In [1]:
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))
Autosave disabled
The process of creating a spam detector using the naive Bayes algorithm is split up into four steps.
We need the module os
for reading directories and the module re
for
regular expressions.
In [2]:
import os
import re
import numpy as np
import math
An object of class `Counter` is a special form of a dictionary
that is used for counting. We need a counter to figure out what the most common words are.
In [3]:
from collections import Counter
The directory https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData contains 960 emails that are divided into four subdirectories:
spam-train
contains 350 spam emails for training,ham-train
contains 350 non-spam emails for training,spam-test
contains 130 spam emails for testing,ham-test
contains 130 non-spam emails for testing.Originally, this data has been collected by Ion Androutsopoulos. I have found this data on the page http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.
We declare some variables so this notebook can be adapted to other data sets.
In [4]:
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test = 'EmailData/spam-test/'
ham__dir_test = 'EmailData/ham-test/'
Directories = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]
In order to compute the prior probability that an email is ham or spam we need to count the number of spam and ham emails.
In [5]:
no_spam = len(os.listdir(spam_dir_train))
no_ham = len(os.listdir(ham__dir_train))
spam_prior = no_spam / (no_spam + no_ham)
ham__prior = no_ham / (no_spam + no_ham)
spam_prior, ham__prior
Out[5]:
(0.5, 0.5)
I have checked that the proportion of spam and ham emails in the test directory is also $1:1$. If the proportion of spam and ham emails in life is different from $1:1$, then we would have to use this proportion in the spam filter to be developed.
The function $\texttt{get_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument. It reads the file and returns a set of all words that are found in this file. The words are transformed to lower case.
In [6]:
def get_words(fn):
file = open(fn)
text = file.read()
text = text.lower()
return set(re.findall(r"[\w']+", text))
Let us test this function with a small example mail.
In [7]:
get_words('EmailData/ham-train/3-380msg4.txt')
Out[7]:
{'anyone',
'article',
'berkeley',
'book',
'consonant',
'edu',
'english',
'garnet',
'hard',
'helpful',
'hi',
'interest',
'irish',
'laurel',
'm',
'modern',
'palatal',
'phonetics',
'posting',
'project',
'recommend',
'slender',
'source',
'specifically',
'sutton',
'thank',
'too',
'work'}
The function read_all_files
reads all files contained in those directories that are stored in the list Directories
.
It returns a Counter
. For every word $w$ this counter contains the number of files that contain $w$.
In [8]:
def read_all_files():
Words = Counter()
for directory in Directories:
for file_name in os.listdir(directory):
Words.update(get_words(directory + file_name))
return Words
Common_Words
is a list of the 2500 most common words found in all of our emails.
In [9]:
N = 2500 # number of the most common words to use
Word_Counter = read_all_files()
Word_Counter
Out[9]:
Counter({'eminent': 9,
'earn': 69,
'experience': 123,
'through': 155,
'phd': 22,
'prestige': 9,
'increase': 69,
'grant': 23,
'effort': 75,
'mba': 8,
'choice': 51,
'here': 259,
'short': 86,
'field': 117,
'part': 131,
'personal': 102,
'programs': 21,
'base': 134,
'ba': 13,
'phone': 202,
'power': 52,
'necessary': 55,
'degree': 41,
'further': 154,
'detail': 143,
'call': 347,
'advance': 81,
'require': 131,
'nonaccredit': 8,
'award': 20,
'present': 142,
'knowledge': 72,
'money': 187,
'university': 307,
'diploma': 10,
'ma': 37,
'cost': 147,
'entire': 45,
'conference': 138,
'grab': 9,
'week': 173,
'receive': 283,
'start': 173,
'leverage': 5,
'offence': 4,
'our': 365,
'delete': 59,
'po': 53,
'old': 83,
'mailer': 20,
'financial': 70,
'member': 104,
'problem': 128,
'believe': 103,
'ago': 65,
'throw': 20,
'customer': 69,
'hello': 54,
'letter': 106,
'inexpensive': 24,
'guarantee': 100,
'ignore': 42,
'complete': 119,
'control': 53,
'outside': 43,
'cash': 91,
'name': 289,
'usa': 122,
'state': 220,
'pardon': 9,
'texa': 35,
'cst': 5,
'reside': 3,
'send': 360,
'lifeline': 1,
'later': 81,
'without': 122,
'print': 107,
'program': 226,
'honestly': 6,
'best': 206,
'nobrainer': 1,
'one': 404,
'note': 148,
'free': 302,
'show': 161,
'computer': 152,
'credit': 103,
'registration': 86,
'must': 181,
'grapevine': 1,
'process': 161,
'center': 60,
'today': 179,
'weekly': 35,
'mind': 62,
'zip': 75,
'interest': 283,
'compound': 12,
'few': 128,
'address': 379,
'simple': 111,
'telephone': 91,
'educational': 22,
'main': 72,
'worth': 48,
'entitle': 13,
'convert': 12,
'plan': 88,
's': 560,
'message': 189,
'join': 95,
'number': 248,
'respond': 45,
'box': 124,
'achieve': 42,
'card': 112,
'life': 99,
'solution': 28,
'mortgage': 18,
'please': 445,
'city': 120,
'information': 448,
'especially': 74,
'net': 100,
'id': 34,
'participate': 63,
'us': 308,
'pull': 8,
'independence': 14,
'tuesday': 21,
'enable': 26,
'company': 139,
'over': 250,
'simply': 123,
'night': 39,
'pm': 42,
'finances': 2,
'intrusion': 18,
'return': 103,
'solid': 15,
'establish': 35,
'mean': 81,
'freedom': 47,
'peace': 7,
'form': 210,
'begin': 69,
'system': 171,
'debt': 40,
'obtain': 41,
'secure': 32,
'per': 141,
'pack': 15,
'cozy': 1,
'oct': 6,
'vacation': 37,
'west': 26,
'archery': 1,
'felton': 1,
'pay': 149,
'e': 294,
'home': 161,
'accomodation': 10,
'virginium': 9,
'turkey': 3,
'deer': 1,
'loader': 6,
'wonderful': 12,
'sesson': 1,
'cook': 3,
'economical': 3,
'meal': 6,
'buck': 12,
'room': 55,
'mail': 350,
'reserve': 28,
'stay': 32,
'noon': 5,
'nov': 3,
'muzzel': 1,
'hunt': 3,
'season': 10,
'announce': 71,
'want': 231,
'follow': 320,
'space': 44,
'wood': 3,
'com': 257,
'compuserve': 26,
'day': 244,
'dec': 7,
'wild': 7,
'lunch': 44,
'book': 145,
'camp': 6,
'three': 103,
'doe': 29,
'additional': 110,
'million': 111,
'wi': 1,
'reach': 72,
'commercial': 37,
'info': 71,
'future': 116,
'success': 82,
'nettool': 1,
'fingertip': 9,
'internet': 188,
'software': 119,
'network': 43,
'search': 88,
'permanently': 12,
'area': 161,
'evaluation': 39,
'proper': 23,
'requirement': 43,
'presence': 26,
'section': 74,
'stop': 75,
'regard': 60,
'propose': 46,
'web': 211,
'advantage': 67,
'sender': 27,
'certain': 47,
'help': 164,
'remove': 203,
'storefront': 2,
'target': 36,
'product': 137,
'fellow': 22,
'promote': 38,
'luck': 34,
'basis': 64,
'request': 157,
'loc': 2,
'comply': 24,
'recent': 65,
'lead': 63,
'mailing': 71,
'bill': 84,
'selection': 38,
'c': 174,
'ooo': 1,
'waterford': 1,
'reply': 131,
'ten': 35,
'paragraph': 13,
'post': 113,
'unite': 61,
'transmission': 13,
'gov': 30,
'http': 399,
'entrepreneur': 14,
'subject': 192,
'tool': 70,
'service': 171,
'dear': 70,
'business': 164,
'assist': 24,
'level': 107,
'need': 250,
'sale': 74,
'thoma': 16,
'item': 40,
'unbelievable': 9,
'much': 190,
'try': 125,
'set': 105,
'wish': 142,
'thank': 182,
'market': 156,
'email': 429,
'vast': 5,
'online': 126,
'venture': 7,
'federal': 36,
'audience': 16,
'unwise': 1,
'check': 210,
'greatest': 35,
'unmissable': 1,
're': 198,
'titanictesco': 1,
'park': 32,
'fame': 1,
'onto': 10,
'release': 45,
'include': 354,
'player': 13,
'visit': 138,
'ultimate': 15,
'refreshment': 1,
'stack': 7,
'gossip': 6,
'shop': 35,
'while': 116,
'chart': 10,
'cd': 63,
'never': 120,
'unlikely': 4,
'package': 73,
'alway': 91,
'www': 296,
'undead': 1,
'band': 12,
'why': 118,
'billy': 2,
'event': 50,
'full': 143,
'right': 144,
'digital': 33,
'delay': 16,
'yourself': 90,
'late': 30,
'friend': 90,
'easy': 125,
'available': 254,
'beautiful': 32,
'placebo': 3,
'chance': 63,
'fantastic': 35,
'top': 77,
'pick': 44,
'mtv': 1,
'glamour': 2,
'run': 81,
'access': 98,
'john': 96,
'competition': 40,
'click': 135,
'offer': 229,
'compaq': 5,
'n': 78,
'pop': 22,
'roll': 24,
'scoop': 6,
'dizzy': 1,
'premiere': 4,
'big': 63,
'sound': 76,
'bathtub': 2,
'reporter': 6,
'crash': 7,
'witch': 7,
'radio': 28,
'tesco': 1,
'portrait': 1,
'drink': 6,
'milan': 4,
'down': 96,
'atmosphere': 4,
'play': 64,
'provide': 203,
'london': 40,
'thing': 109,
'aqua': 3,
'crazy': 6,
'fun': 66,
'tale': 3,
'site': 218,
'record': 65,
'spellbind': 1,
'prepare': 44,
'nt': 222,
'true': 78,
'leicester': 3,
'unsubscribe': 23,
'glitz': 2,
'b': 118,
'technology': 84,
'xpack': 1,
'robbie': 7,
'emma': 1,
'fizzy': 1,
'rem': 4,
'icon': 6,
'miss': 66,
'exclusive': 39,
'capitalfm': 23,
'hit': 54,
'spook': 1,
'thursday': 26,
'save': 108,
'straight': 13,
'choose': 75,
'question': 221,
'rock': 11,
'star': 27,
'music': 29,
'europe': 41,
'halloween': 1,
'bumper': 1,
'hesitate': 39,
'accelerate': 5,
'graphic': 29,
'storm': 10,
'horror': 1,
'instant': 16,
'supply': 23,
'special': 162,
'spin': 6,
'prizewin': 1,
'll': 147,
'regular': 41,
'hurry': 12,
'many': 244,
'even': 194,
'colors': 1,
'reveal': 18,
'celine': 3,
'ghost': 1,
'too': 85,
'attend': 30,
've': 124,
'website': 81,
'starstud': 1,
'travolta': 1,
'foyer': 2,
'adulterous': 1,
'list': 329,
'classic': 12,
'absolutely': 50,
'south': 45,
'enter': 74,
'latest': 57,
'doorstep': 5,
'pc': 37,
'prize': 25,
'label': 21,
'roundup': 3,
'connolly': 1,
'dion': 3,
'tell': 130,
'megastar': 1,
'fill': 88,
'desktop': 7,
'presario': 4,
'dolby': 4,
'nail': 3,
'win': 120,
'paradise': 14,
'stock': 31,
'thompson': 10,
'scary': 7,
'titanic': 1,
'couple': 38,
'guess': 17,
'discount': 30,
'flick': 1,
'u': 123,
'entirely': 11,
'amaze': 52,
'link': 96,
'advertisement': 54,
'better': 106,
'william': 34,
'feel': 62,
'become': 90,
'spooky': 1,
'album': 16,
'game': 46,
'still': 99,
'manufacturer': 13,
'buy': 111,
'primary': 23,
'bring': 102,
'screen': 25,
'president': 18,
'biz': 10,
'coolest': 3,
'surround': 12,
'poster': 24,
'everythe': 21,
'fm': 15,
'focus': 78,
'talk': 92,
'team': 32,
'jimmy': 2,
'mailbox': 27,
'cdparadise': 4,
'next': 121,
'catch': 18,
'favourite': 13,
'world': 183,
'saint': 5,
'laugh': 19,
'up': 1,
'whether': 85,
'performance': 22,
'bunch': 6,
'hot': 45,
'bath': 3,
'head': 44,
'fantasy': 7,
'square': 6,
'capital': 58,
'movie': 32,
'major': 123,
'submission': 95,
'hrs': 3,
'resubmit': 2,
'meta': 3,
'automatically': 39,
'report': 129,
'calle': 4,
'notice': 40,
'engine': 50,
'compose': 6,
'fees': 8,
'within': 176,
'advertiser': 16,
'bulk': 79,
'after': 163,
'each': 201,
'etc': 150,
'every': 171,
'appropriate': 33,
'page': 181,
'toll': 50,
'monthly': 39,
'pro': 19,
'hr': 18,
'extractor': 14,
'block': 30,
'month': 131,
'review': 87,
'trie': 4,
'submit': 104,
'media': 16,
'tag': 11,
'thousands': 12,
'solve': 12,
'helps': 1,
'reg': 15,
'dollar': 98,
'something': 67,
'gotta': 5,
'wasus': 2,
'spam': 31,
'safeaddress': 1,
'idc': 2,
'discreet': 5,
'powerful': 39,
'quickly': 37,
'exceptions': 4,
'community': 45,
't': 146,
'high': 88,
'literally': 13,
'general': 107,
'along': 78,
'travel': 71,
'ask': 137,
'benefit': 38,
'oversea': 14,
'paper': 176,
'finance': 17,
'soundest': 4,
'promise': 37,
'legally': 14,
'amount': 101,
'extract': 29,
'clearly': 36,
'confirm': 27,
'certainly': 25,
'espouse': 5,
'upon': 51,
'contract': 19,
'beverly': 2,
'word': 191,
'extra': 64,
'nbc': 4,
'thousand': 99,
'means': 50,
'curency': 2,
'work': 304,
'soon': 95,
'monitor': 13,
'before': 190,
'ending': 2,
'themselve': 51,
'transact': 4,
'vary': 24,
'tran': 7,
'march': 62,
'transaction': 9,
'move': 71,
'ca': 111,
'under': 109,
'exactly': 65,
'kid': 23,
'public': 41,
'bl': 4,
'nightly': 7,
'view': 66,
'greatly': 16,
'earlier': 29,
'contact': 203,
'likewise': 13,
'currency': 17,
'minute': 106,
'wall': 17,
'create': 89,
'reason': 77,
'daily': 40,
'yet': 50,
'effect': 46,
'editorial': 15,
'santa': 14,
'optional': 26,
'conversion': 7,
'flaw': 6,
'back': 139,
'completely': 65,
'end': 105,
'amass': 4,
'individual': 75,
'operate': 37,
'organization': 50,
'however': 97,
'watch': 64,
'someone': 86,
'rate': 101,
'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2,
'wealth': 16,
'fortune': 26,
'own': 170,
'wealthiest': 6,
'cartel': 4,
'explosive': 8,
'political': 23,
'membership': 30,
'corner': 14,
'national': 53,
'change': 137,
'hemisphere': 4,
'payable': 59,
'mllionaire': 2,
'attache': 2,
'dollars': 21,
'write': 160,
'o': 120,
'overnight': 34,
'anniversarry': 2,
'let': 121,
'group': 104,
'first': 274,
'assure': 16,
'rumble': 4,
'profile': 14,
'same': 154,
'attention': 60,
'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2,
'publication': 87,
'continue': 64,
'postage': 23,
'else': 87,
'gold': 28,
'instruction': 102,
'nor': 42,
'cold': 9,
'm': 209,
'int': 6,
'fee': 94,
'most': 232,
'date': 129,
'different': 161,
'announcement': 47,
'concern': 64,
'glad': 15,
'unlike': 16,
'earth': 33,
'guise': 6,
'able': 99,
'parent': 10,
'easily': 63,
'anyone': 130,
'add': 122,
'york': 68,
'depend': 39,
'long': 83,
'ourself': 2,
'allow': 112,
'action': 61,
'pertinent': 5,
'below': 213,
'street': 65,
'exist': 63,
'operation': 23,
'legal': 67,
'advice': 18,
'monica': 4,
'extremely': 47,
'disclose': 6,
'leave': 103,
'cancel': 13,
'important': 110,
'californium': 52,
'lessly': 2,
'refund': 34,
'american': 88,
'uniform': 7,
'document': 38,
'confidential': 16,
'supporter': 6,
'hand': 89,
'read': 167,
'conclude': 22,
'reiterate': 4,
'keep': 120,
'grow': 52,
'until': 86,
'surely': 28,
'hi': 34,
'secret': 55,
'global': 33,
'unlimit': 34,
'administrative': 8,
'profit': 63,
'enquiry': 13,
'divulge': 4,
'don': 59,
'great': 141,
'line': 156,
'learn': 124,
'ship': 60,
'immediately': 84,
'those': 183,
'instruct': 21,
'limited': 23,
'ourselve': 14,
'worldwide': 44,
'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2,
'excerpt': 8,
'purpose': 40,
'source': 67,
'plus': 109,
'again': 128,
'office': 95,
'left': 3,
'school': 65,
'low': 51,
'hundred': 66,
'envy': 4,
'total': 79,
'hills': 2,
'd': 212,
'recently': 56,
'second': 117,
'suite': 58,
'exchange': 43,
'share': 84,
'method': 99,
'fluctuate': 5,
'differential': 6,
'around': 67,
'britney': 4,
'tomorrow': 9,
'tip': 29,
'noisy': 2,
'listen': 29,
'loca': 1,
'sneak': 5,
'excite': 61,
'peek': 6,
'everybody': 16,
'teacher': 25,
'ride': 4,
'beer': 11,
'smash': 3,
'calm': 2,
'vonda': 1,
'answer': 95,
'gerus': 2,
'century': 23,
'ever': 120,
'stereo': 1,
'chat': 21,
'californication': 1,
'universal': 33,
'channel': 19,
'globe': 10,
'zone': 11,
'hottest': 30,
'chilus': 1,
'uk': 97,
'celluloid': 1,
'red': 11,
'tvchannel': 3,
'lines': 8,
'song': 15,
'wait': 71,
'singles': 7,
'fave': 1,
'halliwell': 2,
'past': 80,
'terminator': 1,
'whispering': 1,
'tv': 26,
'compzone': 10,
'preacher': 3,
'june': 60,
'eurovision': 2,
'man': 46,
'forthcome': 12,
'rd': 59,
'piece': 42,
'centrepiece': 1,
'goss': 1,
'playstation': 5,
'break': 89,
'newradioworld': 3,
'bag': 10,
'ricky': 2,
'stress': 20,
'fabulous': 21,
'manic': 3,
'itch': 1,
'diary': 2,
'backstreet': 2,
'live': 114,
'highlight': 13,
'examiner': 1,
'ad': 74,
'recognise': 3,
'entertain': 7,
'martin': 21,
'angele': 14,
'studio': 8,
'beverage': 10,
'gear': 4,
'st': 90,
'saturday': 44,
'dreamcast': 1,
'somethe': 4,
'webchat': 2,
'co': 53,
'delivery': 52,
'summer': 30,
'both': 176,
'weekend': 20,
'ring': 8,
'th': 156,
'despatch': 4,
'foursome': 1,
'preparation': 12,
'madonna': 3,
'panic': 2,
'where': 198,
'taylor': 8,
'till': 4,
'la': 42,
'girls': 5,
'professor': 46,
'baz': 3,
'livin': 1,
'winning': 6,
'luhrmann': 3,
'revisionline': 1,
'holiday': 25,
'meet': 98,
'lyric': 10,
'shepard': 1,
'boyzone': 6,
'revision': 3,
'video': 73,
'size': 37,
'nd': 59,
'bargain': 16,
'goodie': 2,
'precious': 3,
'vote': 15,
'braless': 1,
'prof': 34,
'ticket': 37,
'feature': 90,
'prior': 37,
'rubber': 1,
'carefully': 29,
'really': 106,
'order': 271,
'vida': 1,
'pepper': 1,
'ball': 7,
'film': 19,
'musical': 11,
'wednesday': 23,
'expensive': 22,
'millennium': 8,
'boy': 35,
'schizophonic': 2,
'winner': 25,
'lot': 85,
'cinema': 14,
'margherita': 2,
'gadget': 2,
'title': 132,
'sega': 1,
'comp': 9,
'lo': 19,
'everyone': 74,
'kit': 12,
'mark': 66,
'character': 21,
'preorder': 7,
'price': 139,
'love': 55,
'hits': 6,
'jeni': 1,
'girl': 36,
'xxx': 38,
'teen': 19,
'trial': 33,
'tempt': 4,
'index': 56,
'adult': 70,
'tantalize': 3,
'forbidden': 2,
'html': 129,
'mci': 17,
'z': 13,
'ones': 4,
'shortest': 1,
'range': 57,
'kevin': 6,
'cyberpromo': 3,
'several': 109,
'blank': 17,
'familiar': 20,
'numerous': 22,
'duplicate': 31,
'circle': 10,
'finish': 24,
'opportunity': 116,
'extension': 23,
'international': 141,
'user': 53,
'gigabyte': 1,
'contain': 98,
'newer': 2,
'responsive': 7,
'broad': 18,
'possible': 131,
'profanity': 9,
'fund': 42,
'bank': 81,
'randomly': 4,
'seal': 5,
'offers': 6,
'almost': 50,
'off': 107,
'ours': 17,
'canada': 55,
'cause': 28,
'released': 15,
'teaser': 1,
'newsgroup': 15,
'risk': 50,
'cleanest': 16,
'postings': 1,
'vulgarity': 10,
'cut': 43,
'mine': 30,
'highly': 37,
'alberta': 2,
'sort': 36,
'fax': 247,
'filter': 31,
'produce': 63,
'seeker': 3,
'wrap': 23,
'ups': 2,
'dure': 13,
'download': 43,
'dupe': 12,
'kick': 7,
'undeliverable': 25,
'sell': 108,
'finally': 51,
'fedex': 5,
'unique': 42,
'real': 90,
'anon': 10,
'nobody': 12,
'fold': 8,
'generate': 67,
'private': 30,
'nospam': 1,
'mil': 15,
'bonus': 49,
'enclose': 43,
'monrose': 1,
'mlmer': 1,
'type': 183,
'nondeliverable': 1,
'key': 40,
'actually': 54,
'unless': 27,
'adam': 8,
...})
In [10]:
Common_Words = { w for w, _ in Word_Counter.most_common(N) }
Common_Words
Out[10]:
{'load',
'comprise',
'familiar',
'teen',
'massive',
'gamble',
'none',
'implementation',
'majority',
'cgibin',
'rejection',
'smaller',
'launch',
'lee',
'database',
'food',
'window',
'transfer',
'candidate',
'delay',
'frank',
'multus',
'late',
'engage',
'work',
'government',
'newest',
'call',
'vulgarity',
'material',
'organisation',
'affect',
'hundreds',
'propose',
'john',
'campus',
'competition',
'view',
'penny',
'currency',
'gender',
'class',
'santa',
'andrew',
'bonus',
'refinance',
'organization',
'eric',
'site',
'ongo',
'italy',
'sprachwissenschaft',
'operator',
'little',
'appear',
'ms',
'actually',
'perform',
'monthly',
'opposite',
'latex',
'job',
'forum',
'correct',
'install',
'miss',
'local',
'remain',
'chain',
'music',
'ready',
'hundr',
'dinner',
'bill',
'singapore',
'option',
'multimedium',
'dialect',
'translation',
'most',
'different',
'literature',
'unite',
'sit',
'sun',
'desirous',
'bear',
'scientist',
'income',
'urge',
'life',
'extensive',
'label',
'city',
'july',
'mouse',
'win',
'continent',
'die',
'tom',
'diploma',
'edit',
'fulfill',
'sequence',
'lucky',
'less',
'manufacturer',
'implication',
'colingacl',
'global',
'referral',
'western',
'using',
'map',
'great',
'response',
'wrong',
'bind',
'analyse',
'busy',
'pb',
'commerce',
'ext',
'polish',
'worldwide',
'previously',
'peter',
'studies',
'appropriate',
'mailbox',
'again',
'partner',
'truly',
'catch',
'develop',
'meeting',
'title',
'cfp',
'parttime',
'begin',
'dozen',
'addition',
'artificial',
'experiment',
'mci',
'around',
'alexis',
'april',
'dictionary',
'receive',
'internet',
'exercise',
'edinburgh',
'oversea',
'paper',
'charge',
'j',
'store',
'nice',
'raleigh',
'six',
'style',
'www',
'member',
'activity',
'y',
'onetime',
'comment',
'status',
'means',
'reap',
'chat',
'utility',
'usage',
'beautiful',
'extractor',
'hotmail',
'phrase',
'doctor',
'generally',
'highly',
'helpful',
'access',
'red',
'ac',
'usa',
'structural',
'likewise',
'major',
'wherea',
'acl',
'canadian',
'cds',
'mastercard',
'programme',
'living',
'ps',
'goe',
'design',
'end',
'foot',
'postscript',
'empirical',
'color',
'corner',
'unsubscribe',
'change',
'deal',
'substantial',
'planet',
'michigan',
'dollars',
'simon',
'trip',
'award',
'credit',
'distinguish',
'quantifier',
'registration',
'grateful',
'doe',
'hesitate',
'psycholinguistic',
'robert',
'publication',
'typical',
'demo',
'log',
'e',
'interest',
'm',
'tremendous',
'simple',
'phonological',
'excellent',
'enjoy',
'genie',
'preview',
'december',
'impossible',
'america',
'webmaster',
'dori',
'le',
'de',
'please',
'reality',
'heart',
'weeks',
'parse',
'meet',
'russian',
'dear',
'netherland',
'speak',
'floor',
'blvd',
'entirely',
'clearance',
'practical',
'contents',
'medium',
'lay',
'somewhat',
'surely',
'accompany',
'buy',
'judgment',
'orders',
'fairchild',
'zero',
'focus',
'spout',
'wish',
'translate',
'vendor',
'faith',
'sincerely',
'mixe',
'draw',
'typology',
'joan',
'participation',
'quote',
'integrate',
'recently',
'vocabulary',
'interaction',
'kit',
'cycle',
'session',
'august',
'sometime',
'non',
'volumes',
'anderson',
'anna',
'discussion',
'diskette',
'finding',
'entire',
'fl',
'tip',
'player',
'traditional',
'lie',
'opportunity',
'ltd',
'shop',
'front',
'reread',
'alway',
'condition',
'band',
'sex',
'eye',
'proper',
'century',
'avenue',
'buck',
'motivation',
'postfach',
'macintosh',
'ignore',
'inquiry',
'vary',
'md',
'poor',
'move',
'top',
'capture',
'european',
'harri',
'israel',
'modify',
'eventually',
'conversation',
'assessment',
'produce',
'coverage',
'media',
'click',
'indo',
'ma',
'female',
'genuine',
'typological',
'interdisciplinary',
'predicate',
'banner',
'provide',
'back',
'generate',
'independent',
'june',
'monday',
'individual',
'speed',
'dan',
'thing',
'demand',
'wealth',
'value',
'programs',
'texa',
'nt',
'national',
'introduction',
'amazing',
'hr',
'intelligent',
'request',
'surface',
'classified',
'policy',
'mediumsize',
'plain',
'hit',
'im',
'commonly',
'star',
'guy',
'symposium',
'w',
'september',
'forever',
'notify',
'rest',
'repeat',
'martin',
'description',
'undoubtedly',
'myself',
'zip',
'kong',
'happen',
'direct',
'percentage',
'grammar',
'delivery',
'reply',
'du',
'weekend',
'although',
'french',
'rich',
'http',
'stay',
'scheme',
'conceptual',
'deep',
'subject',
'abuse',
'consult',
'below',
'fastest',
'organiser',
'comparable',
'currently',
'influence',
'suppose',
'htm',
'enhance',
'video',
'jone',
'document',
'mb',
'consideration',
'apology',
'iro',
'michael',
'result',
'bid',
'until',
'excess',
'put',
'exclude',
'hi',
'unlimit',
'bring',
'explore',
'trend',
'try',
'numbers',
'expect',
'organise',
'ed',
'history',
'favorite',
'due',
'nc',
'promptly',
'yours',
'san',
'reports',
'documentation',
'initially',
'near',
'birth',
'park',
'trash',
'firm',
'confident',
'virtually',
'syntactic',
'evergrow',
'quit',
'register',
'general',
'stun',
'benefit',
'quality',
'spain',
'teacher',
'research',
'pic',
'participant',
'evaluation',
'responsible',
'perceive',
'institution',
'bernard',
'off',
'cooperation',
'marketing',
'north',
'illustrate',
'hardcore',
'yield',
'newsgroup',
'fantastic',
'science',
'beach',
'innovative',
'treat',
'signal',
'run',
'exactly',
'brand',
'pennsylvanium',
'wait',
'contact',
'minute',
'totally',
'statistics',
'state',
'amateur',
'researcher',
'clean',
'cluster',
'open',
'reconstruction',
'chri',
'perfectly',
'help',
'completely',
'operate',
'loss',
'watch',
'approve',
'someone',
'arise',
'scope',
'development',
'unless',
'version',
'necessarily',
'edition',
'dutch',
'novel',
'trial',
'juno',
'fast',
'millions',
'twenty',
'dramatically',
'anywhere',
'original',
'acceptance',
'downsize',
'today',
'exact',
'weekly',
'keynote',
'forget',
'characteristic',
'txt',
'even',
'increase',
'clear',
'advertiser',
'purchase',
'date',
'integration',
'conversational',
'adult',
'classic',
'plan',
'earth',
'bottom',
'associate',
'sales',
'south',
'comprehensive',
'making',
'transcription',
'easily',
'finger',
'la',
'mortgage',
'add',
'conceal',
'verbal',
'underlie',
'long',
'import',
'analysis',
'tool',
'application',
'perhap',
'snail',
'review',
'easiest',
'extremely',
'though',
'verify',
'leave',
'virtual',
'dynamic',
'recipient',
'couple',
'least',
'germany',
'useful',
'client',
'television',
'prof',
'scott',
'phd',
'importance',
'gb',
'korean',
'order',
'variation',
'aim',
'trust',
'thoma',
'corporations',
'musical',
'much',
'lifetime',
'intrusion',
'set',
'iii',
'potential',
'concept',
'country',
'hotel',
'school',
'master',
'acquire',
'laugh',
'mo',
'psychological',
'club',
'obviously',
'debt',
'extract',
'vision',
'million',
'acquisition',
'object',
'human',
'sake',
'include',
't',
'success',
'theoretical',
'community',
'vacation',
'sample',
'wh',
'five',
'finance',
'search',
'datum',
'engine',
'while',
'everybody',
'february',
'author',
'editor',
'summarize',
'fundamental',
'publish',
'blackwell',
'yourself',
'hello',
'investigation',
'universal',
'karen',
'jump',
'umontreal',
'follow',
'professional',
'emerge',
'once',
'day',
'jame',
'slip',
'medical',
'borrow',
'song',
'idea',
'hour',
'argument',
'obvious',
'listing',
'november',
'subscription',
'shift',
'pleasure',
'london',
'however',
'fun',
'tree',
'man',
'classroom',
'dates',
'mt',
'accommodation',
'cm',
'alternative',
'send',
'diversity',
'framework',
'bag',
'cognition',
'relationship',
'reasonable',
'att',
'exceed',
'von',
'during',
'region',
'accessible',
'linguistics',
'refer',
'observe',
'britain',
'creditor',
'specify',
'moneymake',
'relevance',
'honor',
'connection',
'news',
'parameter',
'twelve',
'cv',
'mike',
'parallel',
'apply',
'is',
'here',
'short',
'password',
'mellon',
'textbook',
'telephone',
'morn',
'paragraph',
'educational',
'worth',
'effective',
'reflect',
'normal',
'compute',
'ba',
'pretty',
'comparative',
'contribute',
'latest',
'minimum',
'interpretation',
'australium',
'drop',
'tell',
'fill',
'richer',
'compile',
'residual',
'convention',
'belgium',
'martha',
'functional',
'variety',
'white',
'spell',
'interview',
'isp',
'cognitive',
'slavic',
'u',
'possibility',
'speech',
'hire',
'spokane',
'japanese',
'education',
'unify',
'distribute',
'perception',
'enquiry',
'president',
'poster',
'african',
'town',
'overload',
'everythe',
'testimonial',
'death',
'christian',
'purpose',
'return',
'team',
'duration',
'cinema',
'lot',
'seem',
'linguistic',
'freedom',
'v',
'want',
'cheap',
'total',
'dept',
'literary',
'second',
'honest',
'wife',
'head',
'agency',
'final',
'assume',
'berlin',
'alone',
'compete',
'three',
'attach',
'montreal',
'movie',
'consist',
're',
'incredible',
'lack',
'mistake',
'satisfy',
'fine',
'middle',
'ask',
'capability',
'guest',
'discover',
'resort',
'amount',
'define',
'certainly',
'problem',
'organize',
'believe',
'resell',
'survey',
'whatsoever',
'lisa',
'cent',
'largest',
'requirement',
'property',
'msn',
'mail',
'loan',
'domain',
'released',
'pour',
'sum',
'complete',
'syntax',
'symbol',
'command',
'sheffield',
'interpret',
'reviewer',
'merciless',
'nijmegen',
'dupe',
'industrial',
'phonology',
'create',
'pittsburgh',
'radio',
'together',
'sender',
'privacy',
'nature',
'length',
'contributor',
'forthcome',
'bet',
'genre',
'product',
'expiration',
'indiana',
'nl',
'obligation',
'newsletter',
'ii',
'emailer',
'exclusive',
'note',
'maintain',
'assure',
'rock',
'quick',
'retail',
'copy',
'x',
'ad',
'light',
'serve',
'toy',
'press',
'file',
'days',
'gold',
'strength',
'few',
'eastern',
'both',
'dependency',
'chomsky',
'course',
'morpheme',
'financially',
'device',
'behind',
'situation',
'parent',
'foundation',
'bit',
'orient',
'stealth',
'remember',
'vium',
'personally',
'difficult',
'money',
'jp',
'institute',
'disc',
'lyric',
'correspondence',
'cancel',
'probably',
'asset',
'prompt',
'equipment',
'feature',
'id',
'conclude',
'need',
'sale',
'truth',
'automatically',
'mix',
'spend',
'sorry',
'index',
'art',
'fact',
'department',
'common',
'estate',
'publisher',
'basically',
'conjunction',
'ram',
'arrange',
'whether',
'example',
'sure',
'plenary',
'love',
'preparation',
'actual',
'cameraready',
'cost',
'point',
'experience',
'quickly',
'thousands',
'ultimate',
'browser',
'started',
'reg',
'coordinate',
'rush',
'network',
'indefinite',
'delete',
'mailer',
'package',
'instead',
'side',
'already',
'essential',
'ever',
'responsibility',
'mit',
'brief',
'desire',
'discovery',
'royal',
'update',
'intelligence',
'control',
'present',
'round',
'sponsor',
'occur',
'philosophy',
'current',
'web',
'text',
'broadcast',
'spot',
'effect',
'tradition',
'tv',
'germanic',
...}
Having computed the most common words, we are now ready to compute the conditional probability that a given word occurs in a spam email.
The function $\texttt{get_common_words}(\texttt{fn})$ takes a filename $\texttt{fn}$
as its argument. It reads the file and returns the set of all words in Common_Words
that are found in the given file.
In [11]:
def get_common_words(fn):
return get_words(fn) & Common_Words
We test this function for a small email.
In [12]:
get_common_words('EmailData/ham-train/3-380msg4.txt')
Out[12]:
{'anyone',
'article',
'berkeley',
'book',
'consonant',
'edu',
'english',
'hard',
'helpful',
'hi',
'interest',
'm',
'modern',
'phonetics',
'project',
'recommend',
'source',
'specifically',
'thank',
'too',
'work'}
The function count_common_words
takes a string specifying a directory
. It returns a
Counter
that counts how often the words in Common_Words
occur in any of the files in directory
.
In [13]:
def count_commmon_words(directory):
Words = Counter()
for file_name in os.listdir(directory):
Words.update(get_common_words(directory + file_name))
return Words
Next, we compute dictionaries that store the number of occurrences in emails for every common word.
In [14]:
Spam_Counter = count_commmon_words(spam_dir_train)
Spam_Counter
Out[14]:
Counter({'earn': 51,
'experience': 63,
'through': 75,
'phd': 6,
'increase': 39,
'grant': 12,
'effort': 42,
'choice': 23,
'here': 146,
'short': 38,
'field': 33,
'part': 50,
'personal': 67,
'programs': 15,
'base': 42,
'ba': 8,
'phone': 93,
'power': 30,
'necessary': 25,
'degree': 9,
'further': 51,
'detail': 55,
'call': 132,
'advance': 20,
'require': 64,
'award': 13,
'present': 27,
'knowledge': 30,
'money': 140,
'university': 15,
'diploma': 7,
'ma': 13,
'cost': 99,
'entire': 29,
'conference': 6,
'week': 104,
'receive': 157,
'start': 106,
'our': 223,
'delete': 39,
'po': 27,
'old': 40,
'mailer': 15,
'financial': 55,
'member': 54,
'problem': 47,
'believe': 56,
'ago': 29,
'throw': 13,
'customer': 52,
'hello': 36,
'letter': 67,
'inexpensive': 16,
'guarantee': 73,
'ignore': 22,
'complete': 54,
'control': 30,
'outside': 20,
'cash': 69,
'name': 133,
'usa': 31,
'state': 103,
'texa': 5,
'send': 154,
'later': 35,
'without': 66,
'print': 56,
'program': 99,
'best': 123,
'one': 168,
'note': 59,
'free': 198,
'show': 76,
'computer': 69,
'credit': 71,
'registration': 9,
'must': 80,
'process': 54,
'center': 16,
'today': 116,
'weekly': 26,
'mind': 33,
'zip': 52,
'interest': 98,
'compound': 3,
'few': 68,
'address': 166,
'simple': 72,
'telephone': 31,
'educational': 3,
'main': 22,
'worth': 32,
'entitle': 4,
'convert': 8,
'plan': 41,
's': 219,
'message': 106,
'join': 54,
'number': 106,
'respond': 24,
'box': 56,
'achieve': 21,
'card': 70,
'life': 65,
'solution': 13,
'mortgage': 17,
'please': 188,
'city': 69,
'information': 153,
'especially': 23,
'net': 68,
'id': 21,
'participate': 31,
'us': 156,
'independence': 12,
'tuesday': 2,
'enable': 9,
'company': 102,
'over': 146,
'simply': 79,
'night': 18,
'pm': 26,
'intrusion': 15,
'return': 65,
'solid': 14,
'establish': 15,
'mean': 17,
'freedom': 36,
'form': 63,
'begin': 32,
'system': 65,
'debt': 30,
'obtain': 19,
'secure': 25,
'per': 79,
'pack': 13,
'vacation': 31,
'west': 6,
'pay': 92,
'e': 87,
'home': 101,
'accomodation': 1,
'wonderful': 10,
'three': 25,
'buck': 7,
'room': 12,
'mail': 179,
'reserve': 16,
'stay': 21,
'season': 7,
'announce': 13,
'want': 145,
'follow': 118,
'space': 16,
'com': 160,
'compuserve': 17,
'day': 154,
'lunch': 8,
'book': 33,
'doe': 9,
'additional': 51,
'million': 85,
'reach': 43,
'commercial': 19,
'info': 47,
'future': 74,
'success': 60,
'internet': 124,
'software': 67,
'network': 22,
'search': 60,
'permanently': 8,
'area': 43,
'evaluation': 4,
'proper': 7,
'requirement': 13,
'presence': 9,
'section': 31,
'stop': 38,
'regard': 20,
'propose': 9,
'web': 93,
'advantage': 45,
'sender': 23,
'certain': 16,
'help': 89,
'remove': 150,
'target': 23,
'product': 98,
'fellow': 16,
'promote': 24,
'luck': 23,
'basis': 18,
'request': 80,
'comply': 17,
'recent': 7,
'lead': 19,
'mailing': 55,
'bill': 54,
'selection': 9,
'c': 49,
'reply': 89,
'ten': 17,
'paragraph': 9,
'post': 30,
'unite': 31,
'transmission': 8,
'gov': 20,
'http': 157,
'entrepreneur': 10,
'subject': 102,
'tool': 36,
'service': 108,
'dear': 37,
'business': 114,
'assist': 9,
'level': 44,
'need': 145,
'sale': 57,
'thoma': 5,
'item': 11,
'much': 102,
'try': 75,
'set': 43,
'wish': 85,
'thank': 89,
'market': 102,
'email': 185,
'online': 85,
'federal': 27,
'audience': 7,
'park': 16,
'check': 126,
'greatest': 27,
're': 104,
'onto': 6,
'release': 30,
'include': 129,
'player': 10,
'visit': 80,
'ultimate': 10,
'shop': 26,
'while': 46,
'chart': 8,
'cd': 41,
'never': 79,
'package': 48,
'alway': 57,
'www': 110,
'band': 8,
'why': 65,
'event': 15,
'full': 65,
'digital': 16,
'right': 96,
'delay': 9,
'yourself': 72,
'late': 13,
'friend': 56,
'easy': 89,
'available': 93,
'beautiful': 20,
'chance': 43,
'fantastic': 27,
'top': 50,
'pick': 32,
'run': 48,
'access': 45,
'john': 14,
'competition': 25,
'click': 100,
'offer': 143,
'pop': 14,
'n': 27,
'roll': 18,
'big': 46,
'sound': 29,
'radio': 17,
'down': 67,
'play': 29,
'provide': 71,
'london': 11,
'thing': 57,
'fun': 44,
'site': 121,
'record': 32,
'prepare': 22,
'nt': 127,
'true': 44,
'unsubscribe': 20,
'b': 40,
'technology': 29,
'miss': 42,
'exclusive': 23,
'capitalfm': 17,
'hit': 42,
'thursday': 3,
'save': 81,
'straight': 6,
'choose': 52,
'question': 76,
'rock': 7,
'star': 16,
'music': 14,
'europe': 12,
'hesitate': 25,
'graphic': 12,
'storm': 7,
'instant': 15,
'supply': 13,
'special': 85,
'll': 107,
'regular': 16,
'hurry': 8,
'many': 117,
'even': 99,
'reveal': 12,
'too': 46,
'attend': 7,
've': 81,
'website': 39,
'list': 166,
'classic': 3,
'absolutely': 39,
'south': 15,
'enter': 50,
'latest': 36,
'pc': 18,
'prize': 16,
'label': 8,
'tell': 71,
'fill': 55,
'win': 87,
'paradise': 7,
'stock': 23,
'thompson': 5,
'discount': 18,
'couple': 21,
'guess': 8,
'u': 48,
'entirely': 4,
'amaze': 37,
'link': 53,
'advertisement': 42,
'better': 57,
'william': 6,
'feel': 33,
'become': 42,
'manufacturer': 9,
'album': 12,
'game': 28,
'still': 52,
'buy': 79,
'primary': 8,
'bring': 35,
'screen': 12,
'president': 7,
'biz': 9,
'surround': 5,
'poster': 1,
'everythe': 15,
'fm': 10,
'focus': 5,
'talk': 29,
'team': 20,
'mailbox': 24,
'next': 69,
'catch': 16,
'favourite': 9,
'world': 80,
'laugh': 12,
'whether': 20,
'performance': 8,
'hot': 27,
'head': 12,
'capital': 43,
'movie': 19,
'major': 57,
'submission': 11,
'automatically': 31,
'report': 70,
'notice': 7,
'engine': 34,
'advertiser': 14,
'within': 89,
'bulk': 58,
'after': 80,
'each': 96,
'etc': 53,
'every': 114,
'appropriate': 5,
'page': 57,
'toll': 36,
'monthly': 28,
'pro': 10,
'hr': 15,
'extractor': 11,
'block': 17,
'month': 94,
'review': 23,
'submit': 19,
'media': 3,
'tag': 3,
'thousands': 10,
'solve': 4,
'something': 32,
'spam': 26,
'reg': 11,
'dollar': 73,
'powerful': 27,
'quickly': 29,
'community': 7,
't': 73,
'high': 47,
'literally': 9,
'general': 13,
'along': 30,
'travel': 31,
'ask': 67,
'benefit': 27,
'oversea': 7,
'paper': 34,
'finance': 14,
'promise': 23,
'legally': 10,
'amount': 61,
'clearly': 7,
'confirm': 9,
'certainly': 11,
'upon': 23,
'contract': 10,
'word': 38,
'extra': 42,
'thousand': 71,
'means': 18,
'work': 123,
'soon': 50,
'monitor': 8,
'before': 89,
'themselve': 21,
'vary': 8,
'march': 10,
'move': 38,
'ca': 43,
'under': 55,
'exactly': 38,
'kid': 14,
'public': 21,
'view': 17,
'greatly': 9,
'earlier': 4,
'contact': 67,
'likewise': 4,
'currency': 10,
'minute': 41,
'wall': 13,
'create': 56,
'daily': 22,
'yet': 20,
'reason': 42,
'effect': 5,
'editorial': 3,
'santa': 4,
'optional': 13,
'back': 85,
'completely': 36,
'end': 47,
'individual': 25,
'operate': 22,
'organization': 23,
'however': 27,
'watch': 36,
'someone': 48,
'rate': 61,
'wealth': 10,
'fortune': 22,
'own': 96,
'political': 3,
'membership': 19,
'corner': 8,
'national': 13,
'change': 59,
'payable': 35,
'dollars': 19,
'write': 55,
'o': 42,
'overnight': 25,
'let': 65,
'group': 34,
'first': 110,
'assure': 9,
'profile': 10,
'same': 69,
'attention': 22,
'publication': 12,
'continue': 33,
'postage': 18,
'else': 51,
'gold': 20,
'instruction': 64,
'nor': 23,
'm': 78,
'fee': 36,
'most': 112,
'date': 53,
'different': 62,
'announcement': 6,
'concern': 12,
'glad': 7,
'unlike': 10,
'earth': 21,
'able': 46,
'parent': 8,
'easily': 40,
'anyone': 61,
'add': 79,
'york': 18,
'depend': 15,
'long': 39,
'below': 99,
'allow': 60,
'action': 37,
'street': 37,
'operation': 10,
'exist': 23,
'legal': 42,
'advice': 6,
'extremely': 25,
'leave': 57,
'cancel': 8,
'important': 45,
'californium': 9,
'refund': 29,
'american': 35,
'document': 11,
'confidential': 13,
'hand': 45,
'read': 85,
'conclude': 12,
'keep': 80,
'grow': 22,
'until': 44,
'surely': 16,
'hi': 24,
'secret': 38,
'global': 17,
'unlimit': 20,
'profit': 48,
'enquiry': 1,
'don': 42,
'great': 81,
'line': 93,
'learn': 53,
'ship': 42,
'immediately': 51,
'those': 77,
'instruct': 15,
'limited': 17,
'ourselve': 9,
'worldwide': 26,
'purpose': 13,
'source': 21,
'plus': 60,
'again': 78,
'office': 48,
'school': 16,
'low': 33,
'hundred': 48,
'total': 47,
'd': 54,
'recently': 21,
'second': 23,
'suite': 36,
'exchange': 19,
'share': 42,
'method': 39,
'extract': 15,
'around': 33,
'tip': 19,
'listen': 13,
'excite': 42,
'teacher': 1,
'everybody': 7,
'beer': 5,
'answer': 50,
'century': 10,
'ever': 78,
'love': 38,
'chat': 16,
'universal': 3,
'channel': 9,
'globe': 5,
'zone': 8,
'hottest': 18,
'uk': 21,
'red': 8,
'song': 6,
'wait': 51,
'past': 42,
'tv': 18,
'compzone': 7,
'june': 10,
'man': 17,
'forthcome': 4,
'rd': 20,
'piece': 32,
'break': 45,
'bag': 7,
'stress': 6,
'fabulous': 17,
'live': 69,
'highlight': 6,
'ad': 49,
'martin': 3,
'angele': 7,
'beverage': 6,
'st': 38,
'saturday': 13,
'co': 18,
'delivery': 33,
'summer': 8,
'both': 45,
'weekend': 12,
'th': 45,
'where': 90,
'la': 8,
'professor': 3,
'holiday': 17,
'meet': 31,
'lyric': 5,
'video': 39,
'size': 22,
'nd': 26,
'bargain': 12,
'vote': 9,
'prof': 2,
'ticket': 24,
'feature': 22,
'prior': 23,
'carefully': 20,
'really': 65,
'order': 130,
'film': 10,
'musical': 5,
'wednesday': 4,
'expensive': 17,
'boy': 24,
'winner': 16,
'cinema': 9,
'lot': 52,
'title': 33,
'lo': 7,
'everyone': 48,
'kit': 10,
'mark': 14,
'character': 3,
'price': 84,
'preparation': 4,
'girl': 20,
'xxx': 21,
'teen': 15,
'trial': 23,
'index': 19,
'adult': 42,
'html': 42,
'mci': 12,
'z': 5,
'range': 14,
'familiar': 7,
'several': 49,
'blank': 12,
'numerous': 10,
'duplicate': 22,
'circle': 6,
'finish': 12,
'opportunity': 73,
'extension': 6,
'international': 38,
'user': 22,
'contain': 43,
'possible': 40,
'broad': 1,
'bank': 51,
'fund': 18,
'almost': 32,
'off': 70,
'ours': 12,
'canada': 12,
'cause': 10,
'released': 10,
'risk': 36,
'newsgroup': 10,
'cleanest': 11,
'vulgarity': 7,
'cut': 26,
'mine': 13,
'highly': 19,
'sort': 17,
'fax': 51,
'filter': 19,
'produce': 33,
'wrap': 15,
'dure': 3,
'download': 30,
'dupe': 8,
'undeliverable': 18,
'sell': 84,
'finally': 34,
'unique': 27,
'real': 52,
'anon': 6,
'nobody': 9,
'private': 23,
'generate': 43,
'mil': 9,
'bonus': 41,
'enclose': 28,
'type': 84,
'key': 20,
'actually': 24,
'unless': 18,
'fast': 27,
'place': 81,
'yes': 30,
'remain': 11,
'valid': 14,
'close': 22,
'specifically': 6,
'since': 37,
'w': 18,
'file': 55,
'huge': 37,
'is': 83,
'tremendous': 11,
'small': 41,
'password': 8,
'purchase': 73,
'are': 56,
'against': 33,
'anything': 43,
'course': 40,
'edu': 9,
'average': 18,
'directory': 23,
'eliminate': 25,
'replace': 16,
'super': 24,
'production': 10,
'bottom': 23,
'clock': 7,
'server': 27,
'account': 46,
'rich': 24,
'gather': 7,
'webmaster': 12,
'marketer': 14,
'envelope': 27,
'postmaster': 6,
'abuse': 10,
'stealth': 22,
'whole': 24,
'inside': 15,
'ensure': 7,
'org': 14,
'vium': 45,
'faster': 23,
'removal': 9,
'investment': 36,
'longer': 23,
'classify': 10,
'cdrom': 12,
'pure': 12,
'isp': 17,
'road': 25,
'less': 57,
'client': 18,
'result': 56,
'bid': 18,
'excess': 18,
'put': 77,
'reduce': 25,
'fresh': 42,
'otherwise': 14,
'using': 27,
'response': 44,
'combine': 14,
'fact': 43,
'tout': 6,
'addresses': 19,
'numbers': 10,
'collect': 24,
'country': 41,
'due': 26,
'seem': 18,
'flame': 10,
'prodigy': 7,
'sign': 41,
'dozen': 5,
'test': 35,
'example': 35,
'near': 15,
'sure': 65,
'lists': 20,
'consist': 3,
'actual': 9,
'diskette': 8,
'fine': 10,
'act': 21,
'doubt': 27,
'magazine': 18,
'window': 33,
'compress': 7,
'stimulate': 4,
'activity': 12,
'whatsoever': 13,
'comment': 11,
'position': 35,
'multiple': 16,
'macintosh': 5,
'utility': 12,
'everything': 50,
'meg': 7,
'treat': 22,
'intelligence': 15,
'command': 6,
'once': 68,
'conversation': 6,
'compatible': 6,
'disk': 13,
'girlfriend': 4,
'design': 34,
'rom': 11,
'above': 62,
'mac': 9,
'differently': 4,
'woman': 15,
'king': 6,
'protection': 13,
'install': 9,
'celebrity': 7,
'correct': 14,
'copy': 64,
'guy': 15,
'code': 58,
'personality': 4,
'x': 46,
'either': 33,
'toy': 9,
'existence': 6,
'voice': 14,
'likes': 6,
'hear': 42,
'hard': 44,
'unmark': 6,
'ibm': 7,
'boyfriend': 5,
'sexual': 12,
'reality': 8,
'turn': 41,
'model': 8,
'remember': 45,
'deat': 28,
'higher': 15,
'continent': 5,
'interactive': 11,
'realistic': 9,
'guide': 29,
'drive': 23,
'relate': 17,
'virtual': 10,
'blvd': 12,
'least': 44,
'upset': 7,
'obey': 4,
'beg': 5,
'sexually': 8,
'attitude': 5,
'ram': 7,
'inform': 13,
'partner': 27,
'v': 28,
'blast': 7,
'club': 18,
'artificial': 6,
'clothe': 10,
'imagine': 33,
'porn': 8,
'handle': 26,
'sex': 22,
'story': 20,
'picture': 16,
'birth': 6,
'none': 6,
'rejection': 1,
'charge': 48,
'responsible': 10,
'north': 7,
'qualify': 23,
'law': 33,
'perform': 11,
'job': 48,
'annual': 12,
'conduct': 4,
'creditor': 15,
'bankruptcy': 21,
'regardless': 10,
'match': 9,
'apply': 18,
'bad': 12,
'excellent': 24,
'income': 69,
'payment': 38,
'express': 28,
'application': 15,
'seek': 13,
'security': 41,
'made': 9,
'nj': 8,
'student': 13,
'prompt': 16,
'deposit': 21,
'resource': 22,
'history': 13,
'guaranteed': 21,
'signature': 29,
'savings': 12,
'final': 13,
'datum': 9,
'recieve': 11,
'text': 31,
'clean': 19,
'open': 45,
'together': 20,
'cheque': 5,
'value': 30,
'unsolicit': 14,
'england': 6,
'clear': 15,
'direct': 36,
'minimum': 10,
'import': 8,
'disc': 5,
'recipient': 13,
'fully': 22,
'quote': 10,
'pound': 7,
'normally': 8,
'resident': 11,
'virtually': 14,
'collection': 15,
'select': 48,
'resell': 22,
'cent': 19,
'msn': 9,
'marketing': 29,
'ability': 23,
'management': 12,
'compare': 11,
'class': 26,
'hour': 93,
'mastercard': 35,
'nothing': 48,
'copyright': 17,
'speed': 20,
'accept': 50,
'tree': 7,
'mass': 16,
'expiration': 21,
'deal': 34,
'visa': 40,
'anywhere': 49,
'aol': 39,
'ready': 40,
'dream': 46,
'reward': 10,
'smith': 4,
'sales': 22,
'person': 47,
'function': 6,
'step': 55,
'setup': 9,
'currently': 24,
'hours': 14,
'stepby': 15,
'tax': 28,
'touch': 10,
'thesis': 2,
'kind': 25,
'yours': 54,
'provider': 17,
'rights': 22,
'volume': 15,
'trash': 13,
'satisfy': 19,
'period': 23,
'thereafter': 9,
'sample': 18,
'separate': 11,
'quality': 25,
'services': 15,
...})
In [15]:
Ham__Counter = count_commmon_words(ham__dir_train)
Ham__Counter
Out[15]:
Counter({'range': 29,
'comprise': 4,
'through': 33,
'future': 20,
'lab': 9,
'practice': 11,
'coordinate': 7,
'language': 241,
'international': 76,
'research': 116,
'promise': 5,
'area': 72,
'broad': 10,
'www': 116,
'fund': 12,
'identify': 30,
'pari': 15,
'canada': 28,
'work': 99,
'sunday': 9,
'call': 119,
'umontreal': 7,
'follow': 130,
'assess': 9,
'therefore': 16,
'syntax': 65,
'israel': 8,
'modify': 8,
'present': 79,
'ca': 38,
'outside': 10,
'tag': 7,
'view': 33,
'usa': 53,
'current': 32,
'state': 57,
'researcher': 40,
'face': 13,
'together': 43,
'programme': 35,
'morphology': 36,
'provide': 86,
'html': 56,
'examine': 16,
'individual': 27,
'accept': 47,
'arabic': 10,
'own': 32,
'target': 8,
'mt': 7,
'pre': 9,
'little': 13,
'computational': 38,
'national': 30,
'forum': 27,
'coordinator': 9,
'specifically': 8,
'bell': 9,
'europe': 20,
'registration': 52,
'bar': 4,
'france': 21,
'either': 40,
'description': 39,
'c': 79,
'mike': 7,
'direct': 27,
'committee': 50,
'short': 25,
'consequence': 15,
'workshop': 71,
'hebrew': 8,
'date': 35,
'concern': 36,
'theme': 27,
'although': 30,
'edu': 105,
'centre': 29,
'xerox': 12,
'http': 137,
'support': 37,
'subject': 44,
'generation': 19,
'where': 57,
'exist': 25,
'university': 201,
'parse': 14,
'papers': 99,
'possibility': 18,
'iro': 7,
'michael': 32,
'result': 41,
'colingacl': 7,
'aim': 41,
'approach': 71,
'art': 24,
'much': 37,
'each': 53,
'common': 23,
'george': 15,
'potential': 16,
'collect': 8,
'develop': 32,
'body': 8,
'august': 42,
'session': 47,
'final': 33,
'montreal': 14,
'challenge': 17,
'contact': 89,
'process': 69,
're': 52,
'homepage': 13,
'susan': 14,
'text': 77,
'web': 65,
'robert': 28,
'benjamin': 19,
'connection': 11,
'speech': 57,
'read': 32,
'visit': 28,
'editorial': 8,
'william': 18,
'chri': 2,
'order': 73,
'h': 34,
'm': 79,
'l': 51,
'j': 44,
'et': 17,
'grammar': 52,
'relation': 26,
'site': 47,
'development': 56,
'bank': 10,
'pattern': 22,
'resource': 24,
'christian': 8,
'word': 114,
'ed': 31,
'g': 61,
'nl': 42,
'function': 25,
'locate': 9,
'linguistic': 170,
'mean': 43,
'semantics': 53,
'english': 125,
'life': 12,
'le': 21,
'sign': 17,
'de': 87,
'social': 30,
'paul': 28,
'please': 133,
'verbal': 9,
'total': 9,
'note': 47,
'harri': 9,
'lexical': 44,
'elizabeth': 9,
'matter': 14,
'verb': 39,
'john': 60,
'linguistics': 103,
'theory': 71,
'natural': 52,
'philosophy': 14,
'ac': 54,
'information': 174,
'k': 35,
'thompson': 4,
'dynamic': 12,
'industry': 3,
'million': 5,
'experience': 33,
'include': 130,
'conference': 99,
'expense': 10,
'implementation': 10,
'effort': 15,
'benefit': 4,
'software': 23,
'database': 15,
'window': 3,
'strong': 9,
'theart': 3,
'candidate': 11,
'position': 43,
'fax': 130,
'science': 61,
'phonetics': 18,
'complete': 30,
'signal': 9,
'prefer': 19,
'n': 28,
'phonology': 45,
'prosodic': 11,
'jean': 13,
'advantage': 8,
'two': 79,
'design': 23,
'length': 27,
'enclose': 5,
'between': 79,
'mac': 6,
'send': 109,
'statistical': 18,
'break': 24,
'substantial': 12,
'job': 21,
'inc': 18,
'skill': 10,
'scientific': 17,
'house': 15,
'knowledge': 23,
'engineer': 15,
'computer': 44,
'salary': 7,
'x': 43,
'graphic': 8,
'center': 31,
'acoustic': 6,
'singapore': 10,
'publication': 53,
'tel': 61,
'e': 131,
'successful': 19,
'apply': 50,
'mr': 11,
'both': 83,
'personal': 11,
'telephone': 33,
'post': 49,
's': 189,
'join': 13,
'project': 33,
'sun': 8,
'number': 80,
'scientist': 10,
'desirable': 5,
'model': 45,
'analysis': 70,
'tool': 22,
'institute': 52,
'relevant': 24,
'californium': 32,
'technical': 24,
'least': 27,
'less': 13,
'phd': 9,
'us': 72,
'need': 54,
'encourage': 23,
'preferably': 20,
'degree': 18,
'stateof': 3,
'email': 136,
'require': 34,
'interaction': 31,
'system': 59,
'chinese': 16,
'content': 33,
'end': 31,
'official': 11,
'fuer': 12,
'later': 25,
'application': 52,
'yet': 15,
'begin': 21,
'mid': 2,
'inform': 8,
'period': 10,
'six': 13,
'sincerely': 4,
'keep': 8,
'january': 26,
'week': 22,
'sprachwissenschaft': 9,
'expect': 18,
'r': 40,
'student': 65,
'cognitive': 43,
'issue': 77,
'october': 26,
'oxford': 13,
'press': 21,
'upto': 5,
'paper': 100,
'f': 30,
'most': 52,
'pp': 38,
'learn': 34,
'key': 13,
'study': 85,
'wide': 33,
'history': 23,
'concept': 20,
'introduction': 24,
'brief': 20,
'title': 61,
'overview': 12,
'org': 18,
'second': 60,
'accessible': 11,
'cloth': 16,
'first': 100,
'cover': 32,
'book': 79,
'third': 19,
'act': 13,
'secretary': 7,
'theoretical': 41,
'patrick': 10,
'general': 61,
'ask': 34,
'p': 68,
'po': 17,
'representation': 35,
'package': 4,
'author': 72,
'dialogue': 18,
'publish': 57,
'page': 89,
'available': 85,
'preliminary': 8,
'jame': 18,
'interface': 23,
'jan': 18,
'formal': 30,
'prove': 6,
'november': 19,
'inference': 11,
'postscript': 20,
'dates': 13,
'prepare': 14,
'version': 24,
'further': 63,
'b': 44,
'latex': 11,
'notification': 34,
'place': 54,
'o': 39,
'limit': 44,
'van': 26,
'steve': 7,
'original': 28,
'tilburg': 11,
'acceptance': 32,
'proceedings': 20,
'host': 15,
'september': 31,
'involve': 28,
'selection': 15,
'aspect': 54,
'interest': 112,
'invite': 74,
'chair': 29,
'initial': 18,
'box': 39,
'room': 21,
'interpretation': 23,
'professor': 25,
'htm': 11,
'martha': 7,
'netherland': 30,
'important': 44,
'submission': 64,
'bring': 42,
'index': 22,
'topics': 10,
'semantic': 51,
'phone': 54,
'department': 74,
'focus': 55,
'context': 40,
'due': 35,
'office': 19,
'anne': 17,
'guideline': 13,
'form': 83,
'mark': 36,
'topic': 74,
'faculty': 20,
'submit': 63,
'preparation': 6,
'technique': 16,
'lead': 31,
'discussion': 81,
'cluster': 9,
'parameter': 10,
'real': 12,
'recognition': 23,
'linguist': 71,
'principle': 29,
'help': 41,
'background': 19,
'andrew': 14,
'decision': 12,
'algorithm': 9,
'datum': 47,
'our': 44,
'enable': 8,
'hide': 3,
'tree': 3,
'statement': 21,
'affiliation': 47,
'clearly': 19,
'select': 30,
'editor': 32,
'suitable': 11,
'reflect': 20,
'list': 75,
'distribution': 14,
'message': 36,
'criterion': 13,
'set': 38,
'goal': 14,
'goodness': 2,
'mit': 24,
'series': 19,
'foundation': 14,
'valuable': 6,
'request': 32,
'communication': 46,
'maximum': 17,
'below': 52,
'underlie': 12,
'isbn': 18,
'method': 34,
'show': 44,
'review': 44,
'reader': 15,
'decade': 8,
'reviewer': 9,
'document': 21,
'abstracts': 9,
'choice': 10,
'website': 19,
'address': 113,
'field': 59,
'style': 27,
'educational': 9,
'organizer': 26,
'discourse': 51,
'market': 15,
'december': 27,
'methodology': 17,
'contribution': 24,
'structure': 59,
'case': 57,
'variable': 7,
'psychology': 18,
'documentation': 5,
'affect': 8,
'informative': 4,
'corpus': 29,
'audience': 8,
'deadline': 55,
'april': 42,
'italian': 15,
'ignore': 12,
'interpret': 14,
'brian': 11,
'texa': 19,
'operator': 3,
'category': 23,
'ii': 27,
'url': 23,
'china': 6,
'must': 58,
'austin': 10,
'cv': 12,
'translation': 35,
'directory': 6,
'electronic': 29,
'amsterdam': 20,
'long': 24,
'il': 14,
'meet': 40,
'u': 41,
'david': 34,
'un': 11,
'simply': 16,
'iii': 15,
'translate': 9,
'world': 48,
'school': 39,
'v': 45,
'd': 94,
'approximately': 13,
'price': 23,
'ad': 6,
'idea': 24,
'recent': 37,
'thanks': 7,
'commercial': 9,
'equipment': 8,
'file': 21,
'even': 36,
'middle': 8,
'along': 22,
'walk': 6,
'opportunity': 15,
'someone': 12,
'eastern': 11,
'expression': 17,
'man': 13,
'wonder': 9,
'different': 56,
'old': 17,
'confirm': 9,
'instead': 8,
'glad': 3,
'those': 58,
'french': 38,
'talk': 34,
'mass': 7,
'sit': 1,
'foreign': 14,
'thank': 38,
'print': 22,
'ibm': 4,
'lose': 10,
'country': 26,
'discuss': 38,
'respond': 8,
'mary': 8,
'write': 68,
'anyone': 37,
'want': 25,
'one': 125,
'input': 8,
'service': 18,
'question': 92,
'assume': 19,
'though': 18,
'grateful': 7,
'early': 21,
'three': 48,
'speak': 44,
'attention': 26,
'probably': 10,
'bite': 5,
'demonstrate': 12,
'useful': 14,
'net': 7,
'option': 7,
'feature': 42,
'keyword': 9,
'sale': 2,
'else': 15,
'still': 21,
'search': 6,
'engine': 6,
'teacher': 20,
'query': 25,
'conclusion': 13,
'spanish': 20,
'similar': 16,
'response': 18,
'quite': 21,
'try': 16,
'au': 13,
'etc': 57,
'fact': 26,
'nt': 34,
'excellent': 4,
'digital': 5,
'colleague': 16,
'true': 13,
'polish': 10,
'return': 14,
'puzzle': 7,
'seem': 33,
'next': 15,
'waste': 2,
'door': 1,
'decide': 8,
'again': 15,
'favourite': 1,
'indeed': 17,
'yes': 5,
'tell': 22,
'mine': 10,
'fail': 6,
'turn': 12,
'com': 33,
'build': 27,
'surprise': 9,
'perhap': 22,
'returns': 2,
'large': 20,
'head': 24,
'experiment': 10,
'extremely': 8,
'offer': 34,
'relate': 57,
'notion': 21,
'far': 19,
'guy': 7,
'size': 10,
'volume': 42,
'vol': 14,
'syntactic': 30,
'll': 6,
'june': 31,
'edinburgh': 10,
'quality': 18,
'assistant': 9,
'vowel': 12,
'dutch': 17,
'king': 10,
'stress': 6,
'uk': 46,
'ling': 20,
'additional': 27,
'point': 44,
'intend': 27,
'receive': 53,
'register': 24,
'home': 24,
'separate': 18,
'ascii': 13,
'february': 27,
'inch': 5,
'anybody': 11,
'inquiry': 13,
'march': 38,
'name': 84,
'minute': 37,
'compare': 14,
'speaker': 75,
'type': 56,
'margin': 8,
'charle': 16,
'announce': 37,
'during': 19,
'copy': 64,
'perspective': 41,
'seminar': 10,
'universitaet': 13,
'st': 28,
'anonymous': 18,
'announcement': 26,
'th': 67,
'abstract': 81,
'card': 17,
'vium': 41,
'format': 32,
'pure': 3,
'correspondence': 13,
'nd': 17,
'reference': 72,
'slavic': 10,
'germany': 45,
'negation': 9,
'participate': 16,
'speakers': 10,
'accompany': 10,
'maria': 14,
'acceptable': 7,
'publisher': 14,
'arrange': 5,
'standard': 30,
'detail': 47,
'participation': 28,
'famous': 7,
'several': 32,
'lie': 7,
'rejection': 8,
'onepage': 13,
'teach': 35,
'late': 12,
'before': 49,
'day': 30,
'run': 11,
'under': 28,
'campus': 15,
'comparison': 18,
'compatible': 2,
'structural': 15,
'major': 35,
'cultural': 13,
'santa': 8,
'west': 12,
'nature': 22,
'organization': 15,
'scope': 15,
'part': 48,
'foot': 3,
'understand': 37,
'society': 34,
'cognition': 13,
'basis': 28,
'relationship': 17,
'program': 67,
'mexico': 14,
'notify': 14,
'special': 37,
'characteristic': 9,
'lecture': 17,
'hardcopy': 13,
'emphasis': 6,
'summer': 16,
'within': 45,
'course': 41,
'crosslinguistic': 8,
'plan': 22,
'enjoy': 5,
'america': 25,
'conceptual': 11,
'july': 32,
'city': 19,
'functional': 25,
'realistic': 2,
'american': 29,
'consideration': 17,
'over': 32,
'four': 22,
'night': 9,
'pragmatic': 39,
'native': 24,
'joan': 9,
'psychological': 9,
'addition': 24,
'direction': 12,
'share': 24,
'der': 14,
'germanic': 9,
'modal': 7,
'romance': 9,
'belgium': 7,
'around': 16,
'fine': 6,
'traditional': 16,
'previous': 12,
'pay': 21,
'attract': 8,
'possible': 58,
'problem': 54,
'organize': 41,
'property': 17,
'soon': 16,
'sum': 11,
'european': 39,
'highly': 8,
'propose': 27,
'interdisciplinary': 14,
'avoid': 8,
'italy': 20,
'logic': 17,
'framework': 25,
'utrecht': 9,
'ph': 23,
'elsewhere': 12,
'let': 22,
'serve': 12,
'thus': 24,
'attend': 14,
'logical': 12,
'fee': 32,
'term': 38,
'main': 31,
'entitle': 5,
'introductory': 8,
'solution': 11,
'advance': 43,
'heart': 4,
'variety': 36,
'become': 27,
'reduce': 8,
'dus': 8,
'explore': 16,
'purpose': 19,
'association': 33,
'per': 27,
'advertise': 2,
'community': 20,
'german': 36,
'und': 9,
'integration': 12,
'im': 10,
'near': 8,
'sheffield': 10,
'die': 12,
'sense': 20,
'college': 29,
'east': 9,
'explanation': 12,
'ltd': 6,
'trade': 2,
'side': 12,
'answer': 21,
'essential': 4,
'mail': 81,
'innovative': 3,
'material': 34,
'sery': 10,
'daily': 10,
'canadian': 9,
'moment': 2,
'forthcome': 7,
'record': 13,
'classroom': 12,
'genre': 9,
'contrast': 15,
'local': 20,
'welcome': 28,
'ave': 4,
'increase': 12,
'gold': 1,
'co': 18,
'textbook': 8,
'every': 12,
'unite': 17,
'dr': 35,
'contribute': 17,
'latest': 8,
'making': 1,
'australium': 6,
'describe': 21,
'whole': 21,
'grammatical': 32,
'especially': 31,
'discount': 1,
'blvd': 2,
'client': 2,
'prof': 20,
'assist': 8,
'level': 37,
'distribute': 10,
'pb': 10,
'peter': 30,
'plus': 17,
'kind': 29,
'master': 7,
'seller': 1,
'beyond': 13,
'mode': 9,
'directly': 25,
'difference': 18,
'proposal': 23,
'pl': 6,
'across': 17,
'charge': 10,
'base': 57,
'se': 9,
'member': 22,
'usage': 17,
'pour': 6,
'organisation': 6,
'tutorial': 13,
'universite': 11,
'hour': 13,
'respect': 13,
'dan': 13,
'forward': 15,
'fr': 14,
'edition': 7,
'relevance': 15,
'parallel': 13,
'du': 9,
'preference': 14,
'marie': 13,
'la': 19,
'organiser': 16,
'consider': 47,
'leave': 18,
'half': 9,
'summary': 35,
'hold': 63,
'mailto': 1,
'en': 10,
'president': 6,
'organise': 12,
'team': 3,
'institut': 13,
'exact': 4,
'dictionary': 20,
'many': 56,
'japanese': 26,
'behalf': 6,
'reply': 12,
'graduate': 30,
'enough': 11,
'after': 40,
'hbe': 6,
'lot': 14,
'former': 5,
'phrase': 26,
'unfortunately': 3,
'integrate': 21,
'japan': 20,
'jp': 14,
'edit': 10,
'dear': 14,
'link': 22,
'here': 40,
'recently': 25,
't': 33,
'doubt': 5,
'while': 33,
'participant': 38,
'alway': 16,
'upon': 12,
'almost': 4,
'off': 7,
'thousand': 2,
'means': 20,
'themselve': 15,
'count': 8,
'access': 22,
'occur': 11,
'sentence': 25,
'indo': 10,
'borrow': 6,
'suggest': 22,
'effect': 30,
'bilingual': 10,
'back': 15,
'necessarily': 10,
'surface': 17,
'lexicon': 22,
'maintain': 6,
'refer': 20,
'w': 36,
'same': 41,
'multilingual': 19,
'dialect': 20,
'voice': 8,
'another': 28,
'morpheme': 10,
'anthropology': 11,
'comparative': 27,
'account': 41,
'noun': 20,
'urge': 4,
'york': 35,
'fill': 13,
'recognize': 5,
'influence': 14,
'easiest': 1,
'boundary': 6,
'bibliography': 15,
'feel': 11,
'article': 32,
'item': 18,
'previously': 14,
'draw': 11,
'component': 8,
'accord': 17,
'hardly': 9,
'whether': 38,
'literary': 9,
'example': 61,
'identical': 11,
'central': 18,
'constituent': 9,
'attach': 7,
'generative': 15,
'subscribe': 7,
'subscription': 7,
'dissertation': 15,
'unpublish': 16,
'notice': 19,
'report': 30,
'line': 21,
'appear': 33,
'newsletter': 3,
'spring': 4,
'max': 14,
'volumes': 5,
'actual': 6,
'al': 10,
'journal': 27,
'onto': 2,
'collection': 12,
'cd': 4,
'user': 16,
'amount': 15,
'institution': 13,
'believe': 15,
'ago': 17,
'full': 41,
'deliver': 3,
'multus': 8,
'move': 18,
'image': 12,
'produce': 15,
'media': 9,
'medical': 5,
'among': 28,
'greater': 6,
'open': 41,
'outstand': 4,
'down': 5,
'play': 16,
'thing': 20,
...})
For every common word $w$ we compute the probability that $w$ occurs in a spam or ham email. The formula for spam is: $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$}}{\mbox{number of all spam emails}} $$ The formula for ham is similar: $$ P(w \in\texttt{Ham}) = \frac{\mbox{number of ham emails containing $w$}}{\mbox{number of all ham emails}} $$ However, if we would use this formular, than a common word $w$ that, for some reason, hasn't yet occurred in any spam email, would have a probability of $0$ of occurring in spam email. Hence, our classifier would never classify an email with the word $w$ as spam. As this cannot be right, we assume that there is one further spam email that contains every common word. This Laplace smoothing assumption changes the formula for $P(w \in\texttt{Spam})$ as follows: $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$ + 1}}{\mbox{number of all spam emails + 1}} $$
In [16]:
Spam_Probability = {}
Ham__Probability = {}
for w in Common_Words:
Spam_Probability[w] = (Spam_Counter[w] + 1) / (no_spam + 1)
Ham__Probability[w] = (Ham__Counter[w] + 1) / (no_ham + 1)
Spam_Probability
Out[16]:
{'load': 0.037037037037037035,
'comprise': 0.011396011396011397,
'familiar': 0.022792022792022793,
'teen': 0.045584045584045586,
'massive': 0.02564102564102564,
'gamble': 0.04843304843304843,
'none': 0.019943019943019943,
'implementation': 0.002849002849002849,
'majority': 0.05128205128205128,
'cgibin': 0.02564102564102564,
'rejection': 0.005698005698005698,
'smaller': 0.002849002849002849,
'launch': 0.03418803418803419,
'lee': 0.011396011396011397,
'database': 0.07977207977207977,
'food': 0.022792022792022793,
'window': 0.09686609686609686,
'transfer': 0.02564102564102564,
'candidate': 0.017094017094017096,
'delay': 0.02849002849002849,
'frank': 0.03418803418803419,
'multus': 0.05128205128205128,
'late': 0.039886039886039885,
'engage': 0.011396011396011397,
'work': 0.35327635327635326,
'government': 0.039886039886039885,
'newest': 0.03133903133903134,
'call': 0.3789173789173789,
'vulgarity': 0.022792022792022793,
'material': 0.07122507122507123,
'organisation': 0.008547008547008548,
'affect': 0.008547008547008548,
'hundreds': 0.045584045584045586,
'propose': 0.02849002849002849,
'john': 0.042735042735042736,
'campus': 0.002849002849002849,
'competition': 0.07407407407407407,
'view': 0.05128205128205128,
'penny': 0.05128205128205128,
'currency': 0.03133903133903134,
'gender': 0.002849002849002849,
'class': 0.07692307692307693,
'santa': 0.014245014245014245,
'andrew': 0.005698005698005698,
'bonus': 0.11965811965811966,
'refinance': 0.037037037037037035,
'organization': 0.06837606837606838,
'eric': 0.008547008547008548,
'site': 0.3475783475783476,
'ongo': 0.014245014245014245,
'italy': 0.022792022792022793,
'sprachwissenschaft': 0.002849002849002849,
'operator': 0.017094017094017096,
'little': 0.16809116809116809,
'appear': 0.05698005698005698,
'ms': 0.02849002849002849,
'actually': 0.07122507122507123,
'perform': 0.03418803418803419,
'monthly': 0.08262108262108261,
'opposite': 0.008547008547008548,
'latex': 0.005698005698005698,
'job': 0.1396011396011396,
'forum': 0.017094017094017096,
'correct': 0.042735042735042736,
'install': 0.02849002849002849,
'miss': 0.1225071225071225,
'local': 0.07407407407407407,
'remain': 0.03418803418803419,
'chain': 0.042735042735042736,
'music': 0.042735042735042736,
'ready': 0.1168091168091168,
'hundr': 0.02849002849002849,
'dinner': 0.008547008547008548,
'bill': 0.15669515669515668,
'singapore': 0.002849002849002849,
'option': 0.05698005698005698,
'multimedium': 0.008547008547008548,
'dialect': 0.002849002849002849,
'translation': 0.002849002849002849,
'most': 0.32193732193732194,
'different': 0.1794871794871795,
'literature': 0.005698005698005698,
'unite': 0.09116809116809117,
'sit': 0.06552706552706553,
'sun': 0.02564102564102564,
'desirous': 0.03133903133903134,
'bear': 0.017094017094017096,
'scientist': 0.005698005698005698,
'income': 0.19943019943019943,
'urge': 0.019943019943019943,
'life': 0.18803418803418803,
'extensive': 0.02564102564102564,
'label': 0.02564102564102564,
'city': 0.19943019943019943,
'july': 0.019943019943019943,
'mouse': 0.02849002849002849,
'win': 0.25071225071225073,
'continent': 0.017094017094017096,
'die': 0.017094017094017096,
'tom': 0.011396011396011397,
'diploma': 0.022792022792022793,
'edit': 0.022792022792022793,
'fulfill': 0.03418803418803419,
'sequence': 0.05128205128205128,
'lucky': 0.05982905982905983,
'less': 0.16524216524216523,
'manufacturer': 0.02849002849002849,
'implication': 0.002849002849002849,
'colingacl': 0.002849002849002849,
'global': 0.05128205128205128,
'referral': 0.02564102564102564,
'western': 0.011396011396011397,
'using': 0.07977207977207977,
'map': 0.005698005698005698,
'great': 0.2336182336182336,
'response': 0.1282051282051282,
'wrong': 0.039886039886039885,
'bind': 0.017094017094017096,
'analyse': 0.005698005698005698,
'busy': 0.017094017094017096,
'pb': 0.002849002849002849,
'commerce': 0.022792022792022793,
'ext': 0.045584045584045586,
'polish': 0.005698005698005698,
'worldwide': 0.07692307692307693,
'previously': 0.04843304843304843,
'peter': 0.014245014245014245,
'studies': 0.002849002849002849,
'appropriate': 0.017094017094017096,
'mailbox': 0.07122507122507123,
'again': 0.22507122507122507,
'partner': 0.07977207977207977,
'truly': 0.08831908831908832,
'catch': 0.04843304843304843,
'develop': 0.042735042735042736,
'meeting': 0.019943019943019943,
'title': 0.09686609686609686,
'cfp': 0.002849002849002849,
'parttime': 0.02849002849002849,
'begin': 0.09401709401709402,
'dozen': 0.017094017094017096,
'addition': 0.037037037037037035,
'artificial': 0.019943019943019943,
'experiment': 0.008547008547008548,
'mci': 0.037037037037037035,
'around': 0.09686609686609686,
'alexis': 0.002849002849002849,
'april': 0.022792022792022793,
'dictionary': 0.002849002849002849,
'receive': 0.45014245014245013,
'internet': 0.3561253561253561,
'exercise': 0.014245014245014245,
'edinburgh': 0.002849002849002849,
'oversea': 0.022792022792022793,
'paper': 0.09971509971509972,
'charge': 0.1396011396011396,
'j': 0.03418803418803419,
'store': 0.07977207977207977,
'nice': 0.05982905982905983,
'raleigh': 0.037037037037037035,
'six': 0.07977207977207977,
'style': 0.014245014245014245,
'www': 0.3162393162393162,
'member': 0.15669515669515668,
'activity': 0.037037037037037035,
'y': 0.014245014245014245,
'onetime': 0.03133903133903134,
'comment': 0.03418803418803419,
'status': 0.014245014245014245,
'means': 0.05413105413105413,
'reap': 0.03418803418803419,
'chat': 0.04843304843304843,
'utility': 0.037037037037037035,
'usage': 0.011396011396011397,
'beautiful': 0.05982905982905983,
'extractor': 0.03418803418803419,
'hotmail': 0.037037037037037035,
'phrase': 0.011396011396011397,
'doctor': 0.02849002849002849,
'generally': 0.005698005698005698,
'highly': 0.05698005698005698,
'helpful': 0.014245014245014245,
'access': 0.13105413105413105,
'red': 0.02564102564102564,
'ac': 0.019943019943019943,
'usa': 0.09116809116809117,
'structural': 0.002849002849002849,
'likewise': 0.014245014245014245,
'major': 0.16524216524216523,
'wherea': 0.002849002849002849,
'acl': 0.002849002849002849,
'canadian': 0.019943019943019943,
'cds': 0.03133903133903134,
'mastercard': 0.10256410256410256,
'programme': 0.002849002849002849,
'living': 0.02849002849002849,
'ps': 0.02564102564102564,
'goe': 0.019943019943019943,
'design': 0.09971509971509972,
'end': 0.13675213675213677,
'foot': 0.03133903133903134,
'postscript': 0.002849002849002849,
'empirical': 0.002849002849002849,
'color': 0.042735042735042736,
'corner': 0.02564102564102564,
'unsubscribe': 0.05982905982905983,
'change': 0.17094017094017094,
'deal': 0.09971509971509972,
'substantial': 0.05698005698005698,
'planet': 0.02849002849002849,
'michigan': 0.002849002849002849,
'dollars': 0.05698005698005698,
'simon': 0.005698005698005698,
'trip': 0.042735042735042736,
'award': 0.039886039886039885,
'credit': 0.20512820512820512,
'distinguish': 0.005698005698005698,
'quantifier': 0.002849002849002849,
'registration': 0.02849002849002849,
'grateful': 0.005698005698005698,
'doe': 0.02849002849002849,
'hesitate': 0.07407407407407407,
'psycholinguistic': 0.002849002849002849,
'robert': 0.014245014245014245,
'publication': 0.037037037037037035,
'typical': 0.022792022792022793,
'demo': 0.042735042735042736,
'log': 0.03418803418803419,
'e': 0.25071225071225073,
'interest': 0.28205128205128205,
'm': 0.22507122507122507,
'tremendous': 0.03418803418803419,
'simple': 0.20797720797720798,
'phonological': 0.002849002849002849,
'excellent': 0.07122507122507123,
'enjoy': 0.10541310541310542,
'genie': 0.02849002849002849,
'preview': 0.037037037037037035,
'december': 0.037037037037037035,
'impossible': 0.011396011396011397,
'america': 0.08547008547008547,
'webmaster': 0.037037037037037035,
'dori': 0.02849002849002849,
'le': 0.011396011396011397,
'de': 0.014245014245014245,
'please': 0.5384615384615384,
'reality': 0.02564102564102564,
'heart': 0.02849002849002849,
'weeks': 0.03418803418803419,
'parse': 0.002849002849002849,
'meet': 0.09116809116809117,
'russian': 0.005698005698005698,
'dear': 0.10826210826210826,
'netherland': 0.008547008547008548,
'speak': 0.019943019943019943,
'floor': 0.011396011396011397,
'blvd': 0.037037037037037035,
'entirely': 0.014245014245014245,
'clearance': 0.03133903133903134,
'practical': 0.011396011396011397,
'contents': 0.019943019943019943,
'medium': 0.02564102564102564,
'lay': 0.045584045584045586,
'somewhat': 0.019943019943019943,
'surely': 0.04843304843304843,
'accompany': 0.002849002849002849,
'buy': 0.22792022792022792,
'judgment': 0.03133903133903134,
'orders': 0.06837606837606838,
'fairchild': 0.03133903133903134,
'zero': 0.019943019943019943,
'focus': 0.017094017094017096,
'spout': 0.03133903133903134,
'wish': 0.245014245014245,
'translate': 0.008547008547008548,
'vendor': 0.03133903133903134,
'faith': 0.042735042735042736,
'sincerely': 0.10256410256410256,
'mixe': 0.011396011396011397,
'draw': 0.019943019943019943,
'typology': 0.002849002849002849,
'joan': 0.002849002849002849,
'participation': 0.02849002849002849,
'quote': 0.03133903133903134,
'integrate': 0.005698005698005698,
'recently': 0.06267806267806268,
'vocabulary': 0.002849002849002849,
'interaction': 0.002849002849002849,
'kit': 0.03133903133903134,
'cycle': 0.02564102564102564,
'session': 0.011396011396011397,
'august': 0.005698005698005698,
'sometime': 0.05128205128205128,
'non': 0.019943019943019943,
'volumes': 0.017094017094017096,
'anderson': 0.008547008547008548,
'anna': 0.002849002849002849,
'discussion': 0.011396011396011397,
'diskette': 0.02564102564102564,
'finding': 0.014245014245014245,
'entire': 0.08547008547008547,
'fl': 0.037037037037037035,
'tip': 0.05698005698005698,
'player': 0.03133903133903134,
'traditional': 0.05413105413105413,
'lie': 0.017094017094017096,
'opportunity': 0.21082621082621084,
'ltd': 0.008547008547008548,
'shop': 0.07692307692307693,
'front': 0.05698005698005698,
'reread': 0.02849002849002849,
'alway': 0.16524216524216523,
'condition': 0.019943019943019943,
'band': 0.02564102564102564,
'sex': 0.06552706552706553,
'eye': 0.04843304843304843,
'proper': 0.022792022792022793,
'century': 0.03133903133903134,
'avenue': 0.03418803418803419,
'buck': 0.022792022792022793,
'motivation': 0.002849002849002849,
'postfach': 0.002849002849002849,
'macintosh': 0.017094017094017096,
'ignore': 0.06552706552706553,
'inquiry': 0.04843304843304843,
'vary': 0.02564102564102564,
'md': 0.04843304843304843,
'poor': 0.03418803418803419,
'move': 0.1111111111111111,
'top': 0.1452991452991453,
'capture': 0.017094017094017096,
'european': 0.008547008547008548,
'harri': 0.002849002849002849,
'israel': 0.008547008547008548,
'modify': 0.03133903133903134,
'eventually': 0.03418803418803419,
'conversation': 0.019943019943019943,
'assessment': 0.008547008547008548,
'produce': 0.09686609686609686,
'coverage': 0.011396011396011397,
'media': 0.011396011396011397,
'click': 0.28774928774928776,
'indo': 0.002849002849002849,
'ma': 0.039886039886039885,
'female': 0.017094017094017096,
'genuine': 0.022792022792022793,
'typological': 0.002849002849002849,
'interdisciplinary': 0.005698005698005698,
'predicate': 0.002849002849002849,
'banner': 0.02564102564102564,
'provide': 0.20512820512820512,
'back': 0.245014245014245,
'generate': 0.12535612535612536,
'independent': 0.06267806267806268,
'june': 0.03133903133903134,
'monday': 0.02849002849002849,
'individual': 0.07407407407407407,
'speed': 0.05982905982905983,
'dan': 0.005698005698005698,
'thing': 0.16524216524216523,
'demand': 0.02564102564102564,
'wealth': 0.03133903133903134,
'value': 0.08831908831908832,
'programs': 0.045584045584045586,
'texa': 0.017094017094017096,
'nt': 0.3646723646723647,
'national': 0.039886039886039885,
'introduction': 0.014245014245014245,
'amazing': 0.05698005698005698,
'hr': 0.045584045584045586,
'intelligent': 0.005698005698005698,
'request': 0.23076923076923078,
'surface': 0.008547008547008548,
'classified': 0.022792022792022793,
'policy': 0.017094017094017096,
'mediumsize': 0.02564102564102564,
'plain': 0.042735042735042736,
'hit': 0.1225071225071225,
'im': 0.008547008547008548,
'commonly': 0.017094017094017096,
'star': 0.04843304843304843,
'guy': 0.045584045584045586,
'symposium': 0.002849002849002849,
'w': 0.05413105413105413,
'september': 0.017094017094017096,
'forever': 0.05413105413105413,
'notify': 0.02564102564102564,
'rest': 0.08262108262108261,
'repeat': 0.019943019943019943,
'martin': 0.011396011396011397,
'description': 0.014245014245014245,
'undoubtedly': 0.022792022792022793,
'myself': 0.06552706552706553,
'zip': 0.150997150997151,
'kong': 0.005698005698005698,
'happen': 0.10256410256410256,
'direct': 0.10541310541310542,
'percentage': 0.03418803418803419,
'grammar': 0.002849002849002849,
'delivery': 0.09686609686609686,
'reply': 0.2564102564102564,
'du': 0.002849002849002849,
'weekend': 0.037037037037037035,
'although': 0.039886039886039885,
'french': 0.005698005698005698,
'rich': 0.07122507122507123,
'http': 0.45014245014245013,
'stay': 0.06267806267806268,
'scheme': 0.03133903133903134,
'conceptual': 0.005698005698005698,
'deep': 0.02564102564102564,
'subject': 0.2934472934472934,
'abuse': 0.03133903133903134,
'consult': 0.008547008547008548,
'below': 0.2849002849002849,
'fastest': 0.037037037037037035,
'organiser': 0.002849002849002849,
'comparable': 0.011396011396011397,
'currently': 0.07122507122507123,
'influence': 0.005698005698005698,
'suppose': 0.03418803418803419,
'htm': 0.07122507122507123,
'enhance': 0.014245014245014245,
'video': 0.11396011396011396,
'jone': 0.008547008547008548,
'document': 0.03418803418803419,
'mb': 0.02564102564102564,
'consideration': 0.02564102564102564,
'apology': 0.02849002849002849,
'iro': 0.002849002849002849,
'michael': 0.019943019943019943,
'result': 0.1623931623931624,
'bid': 0.05413105413105413,
'until': 0.1282051282051282,
'excess': 0.05413105413105413,
'put': 0.2222222222222222,
'exclude': 0.014245014245014245,
'hi': 0.07122507122507123,
'unlimit': 0.05982905982905983,
'bring': 0.10256410256410256,
'explore': 0.008547008547008548,
'trend': 0.011396011396011397,
'try': 0.21652421652421652,
'numbers': 0.03133903133903134,
'expect': 0.07122507122507123,
'organise': 0.002849002849002849,
'ed': 0.017094017094017096,
'history': 0.039886039886039885,
'favorite': 0.03133903133903134,
'due': 0.07692307692307693,
'nc': 0.042735042735042736,
'promptly': 0.02849002849002849,
'yours': 0.15669515669515668,
'san': 0.022792022792022793,
'reports': 0.06267806267806268,
'documentation': 0.014245014245014245,
'initially': 0.042735042735042736,
'near': 0.045584045584045586,
'birth': 0.019943019943019943,
'park': 0.04843304843304843,
'trash': 0.039886039886039885,
'firm': 0.02564102564102564,
'confident': 0.037037037037037035,
'virtually': 0.042735042735042736,
'syntactic': 0.002849002849002849,
'evergrow': 0.02849002849002849,
'quit': 0.05413105413105413,
'register': 0.07977207977207977,
'general': 0.039886039886039885,
'stun': 0.03133903133903134,
'benefit': 0.07977207977207977,
'quality': 0.07407407407407407,
'spain': 0.002849002849002849,
'teacher': 0.005698005698005698,
'research': 0.07692307692307693,
'pic': 0.017094017094017096,
'participant': 0.039886039886039885,
'evaluation': 0.014245014245014245,
'responsible': 0.03133903133903134,
'perceive': 0.002849002849002849,
'institution': 0.008547008547008548,
'bernard': 0.002849002849002849,
'off': 0.2022792022792023,
'cooperation': 0.005698005698005698,
'marketing': 0.08547008547008547,
'north': 0.022792022792022793,
'illustrate': 0.002849002849002849,
'hardcore': 0.02849002849002849,
'yield': 0.019943019943019943,
'newsgroup': 0.03133903133903134,
'fantastic': 0.07977207977207977,
'science': 0.008547008547008548,
'beach': 0.05982905982905983,
'innovative': 0.014245014245014245,
'treat': 0.06552706552706553,
'signal': 0.011396011396011397,
'run': 0.1396011396011396,
'exactly': 0.1111111111111111,
'brand': 0.04843304843304843,
'pennsylvanium': 0.002849002849002849,
'wait': 0.14814814814814814,
'contact': 0.19373219373219372,
'minute': 0.11965811965811966,
'totally': 0.05982905982905983,
'statistics': 0.042735042735042736,
'state': 0.2962962962962963,
'amateur': 0.042735042735042736,
'researcher': 0.005698005698005698,
'clean': 0.05698005698005698,
'cluster': 0.002849002849002849,
'open': 0.13105413105413105,
'reconstruction': 0.002849002849002849,
'chri': 0.014245014245014245,
'perfectly': 0.05982905982905983,
'help': 0.2564102564102564,
'completely': 0.10541310541310542,
'operate': 0.06552706552706553,
'loss': 0.039886039886039885,
'watch': 0.10541310541310542,
'approve': 0.03133903133903134,
'someone': 0.1396011396011396,
'arise': 0.005698005698005698,
'scope': 0.005698005698005698,
'development': 0.008547008547008548,
'unless': 0.05413105413105413,
'version': 0.045584045584045586,
'necessarily': 0.002849002849002849,
'edition': 0.019943019943019943,
'dutch': 0.002849002849002849,
'novel': 0.005698005698005698,
'trial': 0.06837606837606838,
'juno': 0.03418803418803419,
'fast': 0.07977207977207977,
'millions': 0.045584045584045586,
'twenty': 0.019943019943019943,
'dramatically': 0.03418803418803419,
'anywhere': 0.14245014245014245,
'original': 0.05698005698005698,
'acceptance': 0.017094017094017096,
'downsize': 0.03133903133903134,
'today': 0.3333333333333333,
'exact': 0.05128205128205128,
'weekly': 0.07692307692307693,
'keynote': 0.002849002849002849,
'forget': 0.07977207977207977,
'characteristic': 0.005698005698005698,
'txt': 0.03418803418803419,
'even': 0.2849002849002849,
'increase': 0.11396011396011396,
'clear': 0.045584045584045586,
'advertiser': 0.042735042735042736,
'purchase': 0.21082621082621084,
'date': 0.15384615384615385,
'integration': 0.005698005698005698,
'conversational': 0.002849002849002849,
'adult': 0.1225071225071225,
'classic': 0.011396011396011397,
'plan': 0.11965811965811966,
'earth': 0.06267806267806268,
'bottom': 0.06837606837606838,
'associate': 0.06552706552706553,
'sales': 0.06552706552706553,
'south': 0.045584045584045586,
'comprehensive': 0.019943019943019943,
'making': 0.03418803418803419,
'transcription': 0.002849002849002849,
'easily': 0.1168091168091168,
'finger': 0.03418803418803419,
'la': 0.02564102564102564,
'mortgage': 0.05128205128205128,
'add': 0.22792022792022792,
'conceal': 0.037037037037037035,
'verbal': 0.005698005698005698,
'underlie': 0.002849002849002849,
'long': 0.11396011396011396,
'import': 0.02564102564102564,
'analysis': 0.014245014245014245,
'tool': 0.10541310541310542,
'application': 0.045584045584045586,
'perhap': 0.014245014245014245,
'snail': 0.03133903133903134,
'review': 0.06837606837606838,
'easiest': 0.04843304843304843,
'extremely': 0.07407407407407407,
'though': 0.05698005698005698,
'verify': 0.045584045584045586,
'leave': 0.16524216524216523,
'virtual': 0.03133903133903134,
'dynamic': 0.011396011396011397,
'recipient': 0.039886039886039885,
'couple': 0.06267806267806268,
'least': 0.1282051282051282,
'germany': 0.019943019943019943,
'useful': 0.03133903133903134,
'client': 0.05413105413105413,
'television': 0.042735042735042736,
'prof': 0.008547008547008548,
'scott': 0.005698005698005698,
'phd': 0.019943019943019943,
'importance': 0.008547008547008548,
'gb': 0.011396011396011397,
'korean': 0.002849002849002849,
'order': 0.3732193732193732,
'variation': 0.002849002849002849,
'aim': 0.008547008547008548,
'trust': 0.039886039886039885,
'thoma': 0.017094017094017096,
'corporations': 0.04843304843304843,
'musical': 0.017094017094017096,
'much': 0.2934472934472934,
'lifetime': 0.04843304843304843,
'intrusion': 0.045584045584045586,
'set': 0.12535612535612536,
'iii': 0.014245014245014245,
'potential': 0.13105413105413105,
'concept': 0.017094017094017096,
'country': 0.11965811965811966,
'hotel': 0.022792022792022793,
'school': 0.04843304843304843,
'master': 0.03418803418803419,
'acquire': 0.06552706552706553,
'laugh': 0.037037037037037035,
'mo': 0.02564102564102564,
'psychological': 0.002849002849002849,
'club': 0.05413105413105413,
'obviously': 0.06837606837606838,
'debt': 0.08831908831908832,
'extract': 0.045584045584045586,
'vision': 0.011396011396011397,
'million': 0.245014245014245,
'acquisition': 0.008547008547008548,
'object': 0.008547008547008548,
'human': 0.022792022792022793,
'sake': 0.037037037037037035,
'include': 0.37037037037037035,
't': 0.21082621082621084,
'success': 0.1737891737891738,
'theoretical': 0.002849002849002849,
'community': 0.022792022792022793,
'vacation': 0.09116809116809117,
'sample': 0.05413105413105413,
'wh': 0.005698005698005698,
'five': 0.04843304843304843,
'finance': 0.042735042735042736,
'search': 0.1737891737891738,
'datum': 0.02849002849002849,
'engine': 0.09971509971509972,
'while': 0.1339031339031339,
'everybody': 0.022792022792022793,
'february': 0.011396011396011397,
'author': 0.022792022792022793,
'editor': 0.005698005698005698,
'summarize': 0.008547008547008548,
'fundamental': 0.002849002849002849,
'publish': 0.06837606837606838,
'blackwell': 0.002849002849002849,
'yourself': 0.20797720797720798,
'hello': 0.10541310541310542,
'investigation': 0.017094017094017096,
'universal': 0.011396011396011397,
'karen': 0.022792022792022793,
'jump': 0.039886039886039885,
'umontreal': 0.002849002849002849,
'follow': 0.33903133903133903,
'professional': 0.09686609686609686,
'emerge': 0.005698005698005698,
'once': 0.19658119658119658,
'day': 0.4415954415954416,
'jame': 0.008547008547008548,
'slip': 0.037037037037037035,
'medical': 0.02849002849002849,
'borrow': 0.037037037037037035,
'song': 0.019943019943019943,
'idea': 0.08547008547008547,
'hour': 0.2678062678062678,
'argument': 0.002849002849002849,
'obvious': 0.008547008547008548,
'listing': 0.014245014245014245,
'november': 0.008547008547008548,
'subscription': 0.017094017094017096,
'shift': 0.011396011396011397,
'pleasure': 0.02564102564102564,
'london': 0.03418803418803419,
'however': 0.07977207977207977,
'fun': 0.1282051282051282,
'tree': 0.022792022792022793,
'man': 0.05128205128205128,
'classroom': 0.002849002849002849,
'dates': 0.002849002849002849,
'mt': 0.008547008547008548,
'accommodation': 0.008547008547008548,
'cm': 0.005698005698005698,
'alternative': 0.011396011396011397,
'send': 0.4415954415954416,
'diversity': 0.002849002849002849,
'framework': 0.002849002849002849,
'bag': 0.022792022792022793,
'cognition': 0.002849002849002849,
'relationship': 0.008547008547008548,
'reasonable': 0.022792022792022793,
'att': 0.017094017094017096,
'exceed': 0.019943019943019943,
'von': 0.002849002849002849,
'during': 0.02564102564102564,
'region': 0.005698005698005698,
'accessible': 0.037037037037037035,
'linguistics': 0.005698005698005698,
'refer': 0.02849002849002849,
'observe': 0.005698005698005698,
'britain': 0.002849002849002849,
'creditor': 0.045584045584045586,
'specify': 0.02564102564102564,
'moneymake': 0.05982905982905983,
'relevance': 0.002849002849002849,
'honor': 0.017094017094017096,
'connection': 0.07977207977207977,
'news': 0.08831908831908832,
'parameter': 0.005698005698005698,
'twelve': 0.03418803418803419,
'cv': 0.005698005698005698,
'mike': 0.014245014245014245,
'parallel': 0.005698005698005698,
'apply': 0.05413105413105413,
'is': 0.23931623931623933,
'here': 0.4188034188034188,
'short': 0.1111111111111111,
'password': 0.02564102564102564,
'mellon': 0.002849002849002849,
'textbook': 0.002849002849002849,
'telephone': 0.09116809116809117,
'morn': 0.019943019943019943,
'paragraph': 0.02849002849002849,
'educational': 0.011396011396011397,
'worth': 0.09401709401709402,
'effective': 0.10826210826210826,
'reflect': 0.005698005698005698,
'normal': 0.011396011396011397,
'compute': 0.011396011396011397,
'ba': 0.02564102564102564,
'pretty': 0.039886039886039885,
'comparative': 0.002849002849002849,
'contribute': 0.011396011396011397,
'latest': 0.10541310541310542,
'minimum': 0.03133903133903134,
'interpretation': 0.002849002849002849,
'australium': 0.011396011396011397,
'drop': 0.07122507122507123,
'tell': 0.20512820512820512,
'fill': 0.15954415954415954,
'richer': 0.042735042735042736,
'compile': 0.02564102564102564,
'residual': 0.03418803418803419,
'convention': 0.008547008547008548,
'belgium': 0.002849002849002849,
'martha': 0.002849002849002849,
'functional': 0.014245014245014245,
'variety': 0.014245014245014245,
'white': 0.008547008547008548,
'spell': 0.005698005698005698,
'interview': 0.05413105413105413,
'isp': 0.05128205128205128,
'cognitive': 0.002849002849002849,
'slavic': 0.002849002849002849,
'u': 0.1396011396011396,
'possibility': 0.019943019943019943,
'speech': 0.005698005698005698,
'hire': 0.017094017094017096,
'spokane': 0.03133903133903134,
'japanese': 0.008547008547008548,
'education': 0.03418803418803419,
'unify': 0.002849002849002849,
'distribute': 0.045584045584045586,
'perception': 0.002849002849002849,
'enquiry': 0.005698005698005698,
'president': 0.022792022792022793,
'poster': 0.005698005698005698,
'african': 0.014245014245014245,
'town': 0.03133903133903134,
'overload': 0.039886039886039885,
'everythe': 0.045584045584045586,
'testimonial': 0.04843304843304843,
'death': 0.03418803418803419,
'christian': 0.005698005698005698,
'purpose': 0.039886039886039885,
'return': 0.18803418803418803,
'team': 0.05982905982905983,
'duration': 0.002849002849002849,
'cinema': 0.02849002849002849,
'lot': 0.150997150997151,
'seem': 0.05413105413105413,
'linguistic': 0.002849002849002849,
'freedom': 0.10541310541310542,
'v': 0.08262108262108261,
'want': 0.41595441595441596,
'cheap': 0.03418803418803419,
'total': 0.13675213675213677,
'dept': 0.022792022792022793,
'literary': 0.002849002849002849,
'second': 0.06837606837606838,
'honest': 0.05128205128205128,
'wife': 0.037037037037037035,
'head': 0.037037037037037035,
'agency': 0.05413105413105413,
'final': 0.039886039886039885,
'assume': 0.06837606837606838,
'berlin': 0.002849002849002849,
'alone': 0.07407407407407407,
'compete': 0.019943019943019943,
'three': 0.07407407407407407,
'attach': 0.039886039886039885,
'montreal': 0.002849002849002849,
'movie': 0.05698005698005698,
'consist': 0.011396011396011397,
're': 0.29914529914529914,
'incredible': 0.05413105413105413,
'lack': 0.022792022792022793,
'mistake': 0.022792022792022793,
'satisfy': 0.05698005698005698,
'fine': 0.03133903133903134,
'middle': 0.037037037037037035,
'ask': 0.19373219373219372,
'capability': 0.017094017094017096,
'guest': 0.011396011396011397,
'discover': 0.09116809116809117,
'resort': 0.02849002849002849,
'amount': 0.17663817663817663,
'define': 0.002849002849002849,
'certainly': 0.03418803418803419,
'problem': 0.13675213675213677,
'organize': 0.008547008547008548,
'believe': 0.1623931623931624,
'resell': 0.06552706552706553,
'survey': 0.017094017094017096,
'whatsoever': 0.039886039886039885,
'lisa': 0.008547008547008548,
'cent': 0.05698005698005698,
'largest': 0.05698005698005698,
'requirement': 0.039886039886039885,
'property': 0.03133903133903134,
'msn': 0.02849002849002849,
'mail': 0.5128205128205128,
'loan': 0.05982905982905983,
'domain': 0.06267806267806268,
'released': 0.03133903133903134,
'pour': 0.008547008547008548,
'sum': 0.008547008547008548,
'complete': 0.15669515669515668,
'syntax': 0.002849002849002849,
'symbol': 0.019943019943019943,
'command': 0.019943019943019943,
'sheffield': 0.002849002849002849,
'interpret': 0.002849002849002849,
'reviewer': 0.008547008547008548,
'merciless': 0.03133903133903134,
'nijmegen': 0.002849002849002849,
'dupe': 0.02564102564102564,
'industrial': 0.008547008547008548,
'phonology': 0.002849002849002849,
'create': 0.1623931623931624,
'pittsburgh': 0.008547008547008548,
'radio': 0.05128205128205128,
'together': 0.05982905982905983,
'sender': 0.06837606837606838,
'privacy': 0.037037037037037035,
'nature': 0.008547008547008548,
'length': 0.011396011396011397,
'contributor': 0.002849002849002849,
'forthcome': 0.014245014245014245,
'bet': 0.042735042735042736,
'genre': 0.002849002849002849,
'product': 0.28205128205128205,
'expiration': 0.06267806267806268,
'indiana': 0.002849002849002849,
'nl': 0.002849002849002849,
'obligation': 0.037037037037037035,
'newsletter': 0.045584045584045586,
'ii': 0.011396011396011397,
'emailer': 0.03418803418803419,
'exclusive': 0.06837606837606838,
'note': 0.17094017094017094,
'maintain': 0.03418803418803419,
'assure': 0.02849002849002849,
'rock': 0.022792022792022793,
'quick': 0.08262108262108261,
'retail': 0.03133903133903134,
'copy': 0.18518518518518517,
'x': 0.1339031339031339,
'ad': 0.14245014245014245,
'light': 0.014245014245014245,
'serve': 0.02564102564102564,
'toy': 0.02849002849002849,
'press': 0.037037037037037035,
'file': 0.15954415954415954,
'days': 0.042735042735042736,
'gold': 0.05982905982905983,
'strength': 0.014245014245014245,
'few': 0.19658119658119658,
'eastern': 0.011396011396011397,
'both': 0.13105413105413105,
'dependency': 0.002849002849002849,
'chomsky': 0.002849002849002849,
'course': 0.1168091168091168,
'morpheme': 0.002849002849002849,
'financially': 0.04843304843304843,
'device': 0.017094017094017096,
'behind': 0.02849002849002849,
'situation': 0.03133903133903134,
'parent': 0.02564102564102564,
'foundation': 0.017094017094017096,
'bit': 0.05698005698005698,
'orient': 0.008547008547008548,
'stealth': 0.06552706552706553,
'remember': 0.13105413105413105,
'vium': 0.13105413105413105,
'personally': 0.017094017094017096,
'difficult': 0.037037037037037035,
'money': 0.4017094017094017,
'jp': 0.005698005698005698,
'institute': 0.011396011396011397,
'disc': 0.017094017094017096,
'lyric': 0.017094017094017096,
'correspondence': 0.008547008547008548,
'cancel': 0.02564102564102564,
'probably': 0.07407407407407407,
'asset': 0.042735042735042736,
'prompt': 0.04843304843304843,
'equipment': 0.017094017094017096,
'feature': 0.06552706552706553,
'id': 0.06267806267806268,
'conclude': 0.037037037037037035,
'need': 0.41595441595441596,
'sale': 0.16524216524216523,
'truth': 0.02564102564102564,
'automatically': 0.09116809116809117,
'mix': 0.037037037037037035,
'spend': 0.1396011396011396,
'sorry': 0.06837606837606838,
'index': 0.05698005698005698,
'art': 0.03133903133903134,
'fact': 0.12535612535612536,
'department': 0.02564102564102564,
'common': 0.042735042735042736,
'estate': 0.039886039886039885,
'publisher': 0.019943019943019943,
'basically': 0.04843304843304843,
'conjunction': 0.002849002849002849,
'ram': 0.022792022792022793,
'arrange': 0.014245014245014245,
'whether': 0.05982905982905983,
'example': 0.10256410256410256,
'sure': 0.18803418803418803,
'plenary': 0.002849002849002849,
'love': 0.1111111111111111,
'preparation': 0.014245014245014245,
'actual': 0.02849002849002849,
'cameraready': 0.002849002849002849,
'cost': 0.2849002849002849,
'point': 0.08831908831908832,
'experience': 0.18233618233618235,
'quickly': 0.08547008547008547,
'thousands': 0.03133903133903134,
'ultimate': 0.03133903133903134,
'browser': 0.02564102564102564,
'started': 0.03418803418803419,
'reg': 0.03418803418803419,
'coordinate': 0.002849002849002849,
'rush': 0.042735042735042736,
'network': 0.06552706552706553,
'indefinite': 0.002849002849002849,
'delete': 0.11396011396011396,
'mailer': 0.045584045584045586,
'package': 0.1396011396011396,
'instead': 0.05128205128205128,
'side': 0.011396011396011397,
'already': 0.14814814814814814,
'essential': 0.03133903133903134,
'ever': 0.22507122507122507,
'responsibility': 0.019943019943019943,
'mit': 0.002849002849002849,
'brief': 0.02564102564102564,
'desire': 0.06267806267806268,
'discovery': 0.019943019943019943,
'royal': 0.011396011396011397,
'update': 0.08262108262108261,
'intelligence': 0.045584045584045586,
'control': 0.08831908831908832,
'present': 0.07977207977207977,
'round': 0.011396011396011397,
'sponsor': 0.022792022792022793,
'occur': 0.017094017094017096,
'philosophy': 0.005698005698005698,
'current': 0.06267806267806268,
'web': 0.2678062678062678,
'text': 0.09116809116809117,
'broadcast': 0.017094017094017096,
'spot': 0.022792022792022793,
'effect': 0.017094017094017096,
'tradition': 0.002849002849002849,
'tv': 0.05413105413105413,
'germanic': 0.002849002849002849,
...}
According to our computation, the probabilty that a spam email contains the word 'consonant'
is about $0.28\%$, while the probability that this word occurs in a ham email is $2.55\%$.
In [17]:
Spam_Probability['consonant'], Ham__Probability['consonant']
Out[17]:
(0.002849002849002849, 0.02564102564102564)
For the word 'dollar'
the probabilty that a spam email contains this word is about $21.1\%$, while the probability that this word occurs in a ham email is $1.99\%$.
In [18]:
Spam_Probability['dollar'], Ham__Probability['dollar']
Out[18]:
(0.21082621082621084, 0.019943019943019943)
Given a file name fn
, this function returns the probability that the message contained in the given file is spam.
When implementing the formula
$$\arg\max\limits_{C \in \mathcal{C}} \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$
we have to be careful, because a naive implementation will eveluate the product
$$\prod\limits_{i=1}^m P(f_i \;|\; C)$$
as the number $0$ due to numerical underflow. The trick to compute this product is to remember that
$$ \ln(a \cdot b) = \ln(a) + \ln(b) $$
and therefore transform the product into a sum of logarithms:
$$ \prod\limits_{i=1}^m P(f_i \;|\; C) = \exp\left(\alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) \right) \cdot \exp(-\alpha)$$
Here, the constant $\alpha$ has to be chosen such that the application of the function exp
to the value
$$ \alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) $$
does not lead to an underflow error.
As we want to compute a probability, we have to be aware that the term $$ \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$ is not the probability that the object is of class $C$ but rather is only proportional to this probability. The fact that the probability of an email being spam + the probability that the email is ham must be $1$ enables us to compute the probability.
In [19]:
def spam_probability(fn):
log_p_spam = 0.0
log_p_ham = 0.0
words = get_common_words(fn)
for w in Common_Words:
if w in words:
log_p_spam += math.log(Spam_Probability[w])
log_p_ham += math.log(Ham__Probability[w])
else:
log_p_spam += math.log(1.0 - Spam_Probability[w])
log_p_ham += math.log(1.0 - Ham__Probability[w])
alpha = abs(max(log_p_spam, log_p_ham))
p_spam = math.exp(log_p_spam + alpha) * spam_prior
p_ham = math.exp(log_p_ham + alpha) * ham__prior
return p_spam / (p_spam + p_ham)
Let us test this with a ham email.
In [20]:
spam_probability('EmailData/ham-train/3-430msg1.txt')
Out[20]:
6.289803980920058e-29
Ok, we got this one right. Let us check the general performance.
In order to evalate the performance of this algorithm, we need to define two new concepts: precision and recall. Let us call the ham emails the positives, while the spam emails are called the negatives. Then we define
The precision of the spam classifier is then defined as $$ \texttt{precision} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false positives}} $$ Therefore, the precision measures the percentage of the ham emails in the set of all emails that are classified as ham. The recall of the spam classifier is defined as $$ \texttt{recall} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false negatives}} $$ Therefore, the recall measures the percentage of those ham emails that are indeed classified as ham.
Usually, it is very important that the recall is high as we don't want to loose a ham email because our classifier has incorrectly classified it as a spam email.
On the other hand, having a high precision is not that important. After all, if $10\%$ of the emails offered to us as ham are, in fact, spam, we might tolerate this. However, we would certainly not tolerate loosing $10\%$ of our ham emails because they are incorrectly specified as spam.
The function precission_recall
takes two directories as arguments: spam_dir
is supposed to contain spam emails, while ham_dir
contains ham emails. It computes the precision and the recall of our spam classifier with respect to these test data.
In [21]:
def precission_recall(spam_dir, ham_dir):
TN = 0 # true negatives
FP = 0 # false positives
for email in os.listdir(spam_dir):
if spam_probability(spam_dir + email) > 0.5:
TN += 1
else:
FP += 1
FN = 0 # false negatives
TP = 0 # true positives
for email in os.listdir(ham_dir):
if spam_probability(ham_dir + email) > 0.5:
FN += 1
else:
TP += 1
precision = TP / (TP + FP)
recall = TP / (TP + FN)
accuracy = (TN + TP) / (TN + TP + FN + FP)
return precision, recall, accuracy
In [22]:
precission_recall(spam_dir_train, ham__dir_train)
Out[22]:
(0.8495145631067961, 1.0, 0.9114285714285715)
In [23]:
precission_recall(spam_dir_test, ham__dir_test)
Out[23]:
(0.7791411042944786, 0.9769230769230769, 0.85)
In [ ]:
Content source: karlstroetmann/Artificial-Intelligence
Similar notebooks: